Custom databases:)
-Default NCBI assembly or sequence accession:)
-Besides the automated download and build (ganon build
) ganon provides a highly customizable build procedure (ganon build-custom
) to create databases from local sequence files.
To use custom sequences, just provide them with --input
. ganon will try to retrieve all necessary information necessary to build a database.
Note
-ganon expects assembly accessions in the filename like GCA_002211645.1_ASM221164v1_genomic.fna.gz
. When using --input-target sequence
filenames are not important but sequence headers should contain sequence accessions like >CP022124.1 Fusobacterium nu...
. More information about building by file or sequence can be found here.
Non-standard/custom accessions:)
-It is also possible to use non-standard accessions and headers to build custom databases with --input-file
. This file should contain the following fields (tab-separated): file [<tab> target <tab> node <tab> specialization <tab> specialization_name].
Note that file is mandatory and additional fields not.
Tip
-If you just want to build a database without any taxonomic or target information, just sent the files with --input
, use --taxonomy skip
and choose between --input-target file
or sequence
.
Besides the automated download and build (ganon build
) ganon provides a highly customizable build procedure (ganon build-custom
) to create databases from local sequence files. The usage of this procedure depends on the configuration of your files:
-
+
- Filename like
GCA_002211645.1_ASM221164v1_genomic.fna.gz
: genomic fasta files in the NCBI standard, with assembly accession in the beginning of the filename. Provide the files with the--input
parameter. ganon will try to retrieve all necessary information to build the database.
+ - Headers like
>NC_006297.1 Bacteroides fragilis YCH46 ...
: sequence headers are in the NCBI standard, with sequence accession in after>
and with a space afterwards (or line break). Provide the files with the--input
parameter and set--input-target sequence
. ganon will try to retrieve all necessary information to build the database.
+ - For non-standard filenames and headers, follow this +
Warning
+--input-target sequence
will be slower to build and will use more disk space, since files have be re-written separately for each sequence. More information about building by file or sequence can be found here.
The --level
is a important parameter that will define the (max.) classification level for the database (more infos):
-
+
--level file
orsequence
-> default behavior (depending on--input-target
), use file/sequence as classification target
+--level assembly
-> will retrieve assembly related to the file/sequence, use assembly as classification target
+--level leaves
orspecies
,genus
,... -> group input by taxonomy, use tax. nodes at the rank chosen as classification target
+
More infos about other parameters here.
+Non-standard files/headers with --input-file
:)
+Alternatively to the automatic input methods, it is possible to manually define the input with either standard or non-standard filenames, accessions and headers to build custom databases with --input-file
. This file should contain the following fields (tab-separated):
file [<tab> target <tab> node <tab> specialization <tab> specialization_name].
-
+
file
: relative or full path to the sequence file
+target
: any unique text to name the file, to be used in the taxonomy
+node
: taxonomic node (e.g. taxid) to link entry with taxonomy
+specialization
: creates a specialized taxonomic level with a custom name, allowing files to be grouped
+specialization_name
: a name for the specialization, to be used in the taxonomy
+
Warning
-the target and specialization fields (2nd and 4th col) cannot be the same as the node (3rd col)
+the target
and specialization
fields (2nd and 4th col) cannot be the same as the node
(3rd col)
Examples of --input-file
-- -With --input-target file (default), where my_target_1 and my_target_2 are just names to assign sequences from (unique) sequence files: - - -
sequences.fasta my_target_1
-others.fasta my_target_2
+Below you find example of --input-file
. Note they are slightly different depending on the --input-target
chosen. They need to be tab-separated to be properly parsed (tsv).
+Examples of --input-file
using the default --input-target file
:)
+List of files:)
+sequences.fasta
+others.fasta
-
-
-With --input-target sequence, second column should match sequence headers on provided sequence files (that should be repeated for each header):
-
-
-sequences.fasta HEADER1
-sequences.fasta HEADER2
-sequences.fasta HEADER3
-others.fasta HEADER4
-others.fasta HEADER5
+No taxonomic information is provided so --taxonomy skip
should be set. The classification against the generated database will be performed at file level (--level file
), since that is the only available information given.
+List of files with alternative names:)
+sequences.fasta sequences
+others.fasta others
-
-
-A third column with taxonomic nodes can be provided to link the data with taxonomy. For example with --taxonomy ncbi:
-
-
-sequences.fasta FILE_A 562
-others.fasta FILE_B 623
+Just like above, but with a specific name to be used for each file.
+Files and taxonomy:)
+sequences.fasta sequences 562
+others.fasta others 623
-
-
-
-sequences.fasta HEADER1 562
-sequences.fasta HEADER2 562
-sequences.fasta HEADER3 562
-others.fasta HEADER4 623
-others.fasta HEADER5 623
+The classification max. level against this database will depend on the value set for --level
:
+
+--level file
-> use the file (named with target) with node as parent
+--level leaves
or species
,genus
,... -> files are grouped by taxonomy
+
+Files, taxonomy and specialization:)
+sequences.fasta sequences 562 ID44444 Escherichia coli TW10119
+others.fasta others 623 ID55555 Shigella flexneri 1a
-
-
-Further specializations can be used to create a additional classification level after the taxonomic leaves. For example (using --level custom):
-
-
-sequences.fasta FILE_A 562 ID44444 Escherichia coli TW10119
-others.fasta FILE_B 623 ID55555 Shigella flexneri 1a
+The classification max. level against this database will depend on the value set for --level
:
+
+--level custom
-> use the specialization (named with specialization_name) with node as parent
+--level file
-> use the file (named with target) as a tax. node as parent
+--level leaves
or species
,genus
,... -> files are grouped by taxonomy
+
+Examples of --input-file
using --input-target sequence
:)
+To provide a tabular information for every sequence in your files, you need to use the target
field (2nd col.) of the --input-file
to input sequence headers. For example:
+Sequences and taxonomy:)
+sequences.fasta NZ_CP054001.1 562
+sequences.fasta NZ_CP117955.1 623
+others.fasta header1 666
+others.fasta header2 666
-
-
-
-sequences.fasta HEADER1 562 ID443 Escherichia coli TW10119
-sequences.fasta HEADER2 562 ID297 Escherichia coli PCN079
-sequences.fasta HEADER3 562 ID8873 Escherichia coli P0301867.7
-others.fasta HEADER4 623 ID2241 Shigella flexneri 1a
-others.fasta HEADER5 623 ID4422 Shigella flexneri 1b
+The classification max. level against this database will depend on the value set for --level
:
+
+--level sequence
-> use the sequence header with node as parent
+--level assembly
-> will attempt to retrieve the assembly related to the sequence with node as parent
+--level leaves
or species
,genus
,... -> files are grouped by taxonomy
+
+Sequences, taxonomy and specialization:)
+sequences.fasta NZ_CP054001.1 562 ID44444 Escherichia coli TW10119
+sequences.fasta NZ_CP117955.1 623 ID55555 Shigella flexneri 1a
+others.fasta header1 666 StrainA My Strain
+others.fasta header2 666 StrainA My Strain
-
-
-
The classification max. level against this database will depend on the value set for --level
:
-
+
--level custom
-> use the specialization (named with specialization_name) with node as parent
+--level sequence
-> use the sequence header with node as parent
+--level leaves
orspecies
,genus
,... -> files are grouped by taxonomy
+
Examples:)
-Some examples with download and build commands for custom ganon databases from useful and commonly used repositories and datasets for metagenomics analysis:
+Below you will find some examples from commonly used repositories for metagenomics analysis with ganon build-custom
:
HumGut:)
Collection of >30000 genomes from healthy human metagenomes. Article/Website.
# Download sequence files
@@ -252,27 +284,19 @@ Plasmid, Plastid and Mito
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/"
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/"
-# Split sequences in files and retrieve taxonomy
-mkdir sequences/
-zcat plasmid.* plastid.* mitochondrion.* | awk '$0 ~ ">" {accver=(substr($1,2)); print accver}{print $0 > "sequences/"accver".fna"}' | ganon-get-seq-info.sh -e -i - | awk '{print "sequences/"$1".fna\t"$1"\t"$3}' > ppm.tsv
-
-# Build ganon database
-ganon build-custom --input-file ppm.tsv --db-prefix ppm --level species --threads 16
-
-# OPTIONAL Remove temporary folder and downloaded files
-rm -rf sequences/ ppm.tsv plasmid.* plastid.* mitochondrion.*
+ganon build-custom --input plasmid.* plastid.* mitochondrion.* --db-prefix ppm --level species --threads 8 --input-target sequence
UniVec, UniVec_core:)
"UniVec is a non-redundant database of sequences commonly attached to cDNA or genomic DNA during the cloning process." Website. Useful to screen for vector and linker/adapter contamination. UniVec_core is a sub-set of the UniVec selected to reduce the false positive hits from real biological sources.
# UniVec
wget -O "UniVec.fasta" --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec"
-echo -e "UniVec.fasta\tUniVec\t81077" > UniVec_Core_ganon_input_file.tsv
-ganon build-custom --input-file UniVec_ganon_input_file.tsv --db-prefix UniVec --level leaves --threads 8
+echo -e "UniVec.fasta\tUniVec\t81077" > UniVec_ganon_input_file.tsv
+ganon build-custom --input-file UniVec_ganon_input_file.tsv --db-prefix UniVec --level leaves --threads 8 --skip-genome-size
# UniVec_Core
wget -O "UniVec_Core.fasta" --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec_Core"
echo -e "UniVec_Core.fasta\tUniVec_Core\t81077" > UniVec_Core_ganon_input_file.tsv
-ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves --threads 8
+ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves --threads 8 --skip-genome-size
Note
@@ -280,9 +304,11 @@UniVec, UniVec_coreMGnify genome catalogues (MAGs):)
"Genome catalogues are biome-specific collections of metagenomic-assembled and isolate genomes". Article/Website/FTP.
-There are currently (2023-05-04) 8 genome catalogues available: chicken-gut, human-gut, human-oral, marine, non-model-fish-gut, pig-gut and zebrafish-fecal. An example below how to download and build the human-oral catalog:
+Currently available genome catalogues (2024-02-09): chicken-gut
cow-rumen
honeybee-gut
human-gut
human-oral
human-vaginal
marine
mouse-gut
non-model-fish-gut
pig-gut
zebrafish-fecal
List currently available entries curl --silent --list-only ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/
Example on how to download and build the human-oral
catalog:
# Download metadata
-wget "https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-oral/v1.0/genomes-all_metadata.tsv"
+wget "https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-oral/v1.0.1/genomes-all_metadata.tsv"
# Download sequence files with 12 threads
tail -n+2 genomes-all_metadata.tsv | cut -f 1,20 | xargs -P 12 -n2 sh -c 'curl --silent ${1}| gzip -d | sed -e "1,/##FASTA/ d" | gzip > ${0}.fna.gz'
@@ -291,7 +317,7 @@ MGnify genome catalogues (MAGs)
Note
@@ -313,16 +339,13 @@ Pathogen detection FDA-ARGOSBLAST databases (nt env_nt nt_prok ...):)
-Current available nucleotide databases (2023-05-04): 16S_ribosomal_RNA
18S_fungal_sequences
28S_fungal_sequences
Betacoronavirus
env_nt
human_genome
ITS_eukaryote_sequences
ITS_RefSeq_Fungi
LSU_eukaryote_rRNA
LSU_prokaryote_rRNA
mito
mouse_genome
nt
nt_euk
nt_others
nt_prok
nt_viruses
patnt
pdbnt
ref_euk_rep_genomes
ref_prok_rep_genomes
refseq_rna
refseq_select_rna
ref_viroids_rep_genomes
ref_viruses_rep_genomes
SSU_eukaryote_rRNA
tsa_nt
-
-Note
-List currently available nucleotide databases curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep "nucl-metadata.json" | sed 's/-nucl-metadata.json/, /g' | sort
-
+Current available nucleotide databases (2024-02-09): 16S_ribosomal_RNA
18S_fungal_sequences
28S_fungal_sequences
Betacoronavirus
env_nt
human_genome
ITS_eukaryote_sequences
ITS_RefSeq_Fungi
LSU_eukaryote_rRNA
LSU_prokaryote_rRNA
mito
mouse_genome
nt
nt_euk
nt_others
nt_prok
nt_viruses
patnt
pdbnt
ref_euk_rep_genomes
ref_prok_rep_genomes
refseq_rna
refseq_select_rna
ref_viroids_rep_genomes
ref_viruses_rep_genomes
SSU_eukaryote_rRNA
tsa_nt
+List currently available entries curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep "nucl-metadata.json" | sed 's/-nucl-metadata.json/, /g' | sort
Warning
-Some BLAST databases are very big and may require extreme computational resources to build. You may need to use some reduction strategies
+Some BLAST databases are very big and may require extreme computational resources to build. You may need to use some reduction strategies.
-The example below extracts sequences and information from a BLAST db to build a ganon database:
+The example shows how to download, parse and build a ganon database from BLAST database files. It does so by splitting the database into taxonomic specific files, to speed-up the build process:
# Define BLAST db
db="16S_ribosomal_RNA"
threads=8
@@ -342,18 +365,18 @@ BLAST databases (nt env_nt nt_prok ..
seq 0 9 | xargs -i mkdir -p "${db}"/{}
# This command extracts sequences from the blastdb and writes them into taxid specific files
-# It also generates the --input-file for ganon
+# It also generates the --input-file for ganon with the fields: filepath <tab> file <tab> taxid
blastdbcmd -entry all -db "${db}" -outfmt "%a %T %s" | \
-awk -v db="$(realpath ${db})" '{file=db"/"substr($2,1,1)"/"$2".fna"; print ">"$1"\n"$3 >> file; print file"\t"$2"\t"$2}' | \
+awk -v db="$(realpath ${db})" '{file=db"/"substr($2,1,1)"/"$2".fna"; print ">"$1"\n"$3 >> file; print file"\t"$2".fna\t"$2}' | \
sort | uniq > "${db}_ganon_input_file.tsv"
# Build ganon database
-ganon build-custom --input-file "${db}_ganon_input_file.tsv" --db-prefix "${db}" --threads 12
+ganon build-custom --input-file "${db}_ganon_input_file.tsv" --db-prefix "${db}" --threads ${threads} --level leaves
# Delete extracted files and auxiliary files
cat "${db}_extracted_files.txt" | xargs rm
rm "${db}_extracted_files.txt" "${db}.md5" "${db}_downloaded.md5"
-# Delete sequences
+# Delete sequences and input_file
rm -rf "${db}" "${db}_ganon_input_file.tsv"
@@ -361,7 +384,7 @@ BLAST databases (nt env_nt nt_prok ..
blastdbcmd
is a command from BLAST+ software suite (tested version 2.14.0) and should be installed separately.
Files from genome_updater:)
-To create a ganon database from files previosly downloaded with genome_updater:
+To create a ganon database from files previously downloaded with genome_updater:
ganon build-custom --input output_folder_genome_updater/version/ --input-recursive --db-prefix mydb --ncbi-file-info output_folder_genome_updater/assembly_summary.txt --level assembly --threads 32
Parameter details:)
@@ -371,11 +394,14 @@ False positive and size (--m
minimizers (--window-size, --kmer-size):)
in ganon build
, when --window-size
> --kmer-size
minimizers are used. That means that for a every window, a single k-mer will be selected. It produces smaller database files and requires substantially less memory overall. It may increase building times but will have a huge benefit for classification times. Sensitivity and precision can be reduced by small margins. If --window-size
= --kmer-size
, all k-mers are going to be used to build the database.
Target file or sequence (--input-target):)
-Customized builds can be done either by file or sequence. --input-target file
will consider every file provided with --input
a single unit. --input-target sequence
will use every sequence as a unit.
---input-target file
is the default behavior and most efficient way to build databases. --input-target sequence
should only be used when the input sequences are stored in a single file or when classification at sequence level is desired.
+This is a parameter that defines how ganon will parse your input files:
+ - --input-target file
(default) will consider every file provided with --input
a single unit (e.g. multi-fasta files are considered one input, sequence headers ignored).
+ - --input-target sequence
will use every sequence as a unit. For this, ganon will first decompose every sequence in the input files provided with --input
into a separated file. This will take longer and use more disk space.
+--input-target file
is the default behavior and most efficient way to build databases. --input-target sequence
should only be used when the input sequences are not separated by file (e.g. a single big FASTA file) or when classification at sequence level is desired.
Build level (--level):)
-The --level
parameter defines the max. depth of the database for classification. This parameter is relevant because the --max-fp
is going to be guaranteed at the --level
chosen. By default, the level will be the same as --input-target
, meaning that classification will be done either at file or sequence level.
-Alternatively, --level assembly
will link the file or sequence target information with assembly accessions retrieved from NCBI servers. --level leaves
or --level species
(or genus, family, ...) will link the targets with taxonomic information and prune the tree at the chosen level. --level custom
will use specialization level define in the --input-file
.
+The --level
parameter defines the max. depth of the database for classification. This parameter is relevant because the --max-fp
is going to be guaranteed at the --level
chosen.
+In ganon build
the default value is species
. In ganon build-custom
the level will be the same as --input-target
, meaning that classification will be done either at file
or sequence
level.
+Alternatively, --level assembly
will link the file or sequence target information with assembly accessions retrieved from NCBI. --level leaves
or --level species
(or genus
, family
, ...) will link the targets with taxonomic information and prune the tree at the chosen level. --level custom
will use specialization (4th col.) defined in the --input-file
.
Genome sizes (--genome-size-files):)
Ganon will automatically download auxiliary files to define an approximate genome size for each entry in the taxonomic tree. For --taxonomy ncbi
the species_genome_size.txt.gz is used. For --taxonomy gtdb
the *_metadata.tar.gz files are used. Those files can be directly provided with the --genome-size-files
argument.
Genome sizes of parent nodes are calculated as the average of the respective children nodes. Other nodes without direct assigned genome sizes will use the closest parent with a pre-calculated genome size. The genome sizes are stored in the ganon database.
diff --git a/default_databases/index.html b/default_databases/index.html
index bcda89b5..0c327515 100644
--- a/default_databases/index.html
+++ b/default_databases/index.html
@@ -143,7 +143,7 @@ Databasesganon update -d arc_bac -t 30
-Additionally, custom databases can be built with customized files and identifiers with the ganon build-custom
command.
Additionally, custom databases can be built with customized files and identifiers with the ganon build-custom
command.
Info
We DO NOT provide pre-built indices for download. ganon can build databases very efficiently. This way, you will always have up-to-date reference sequences and get most out of your data.
@@ -293,7 +293,7 @@More filter optionsrepository.
GTDB:)
-
By default, ganon will use the NCBI Taxonomy to build the database. However, GTDB is fully supported and can be used with the parameter --taxonomy gtdb
.
By default, ganon will use the NCBI Taxonomy to build the database. However, GTDB is fully supported and can be used with the parameter --taxonomy gtdb
.
Filtering by taxonomic entries also work with GTDB, for example:
ganon build --db-prefix fuso_gtdb --taxid "f__Fusobacteriaceae" --source refseq genbank --taxonomy gtdb --threads 12
@@ -315,7 +315,7 @@ ReproducibilityReducing database size:)
Filter type (IBF and HIBF):)
-The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates smaller databases with faster query times (article). However, the HIBF takes longer to build and has less flexibility regarding size and further options in ganon. You can choose which filter to use with the --filter-type
parameter in ganon build
and ganon build-custom
-
The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates smaller databases with faster query times (article). However, the HIBF takes a little longer to build and has less flexibility regarding size and further options in ganon. You can choose which filter to use with the --filter-type
parameter in ganon build
and ganon build-custom
.
Due to differences between the default IBF used in ganon and the HIBF, it is recommended to lower the false positive when using the HIBF. The default value for high sensitivity is 1% (--filter-type hibf --max-fp 0.001
).
Hint
diff --git a/index.html b/index.html index 909cf897..345999c3 100644 --- a/index.html +++ b/index.html @@ -146,7 +146,7 @@FeaturesEM and/or LCA algorithms to solve multiple-matching reads
ganon achieved very good results in our own evaluations but also in independent evaluations: LEMMI, LEMMI v2 and CAMI2
@@ -159,17 +159,20 @@Installation from sourcePython dependencies:)
- python >=3.6
-- pandas >=1.1.0
+- pandas >=1.2.0
- multitax >=1.3.1
+- genome_updater >=0.6.3
# Python version should be >=3.6
python3 -V
# Install packages via pip or conda:
# PIP
-python3 -m pip install "pandas>=1.1.0" "multitax>=1.3.1"
-# Conda (alternative)
-conda install "pandas>=1.1.0" "multitax>=1.3.1"
+python3 -m pip install "pandas>=1.2.0" "multitax>=1.3.1"
+wget --quiet --show-progress https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh && chmod +x genome_updater.sh
+
+# Conda/Mamba (alternative)
+conda install -c bioconda -c conda-forge "pandas>=1.2.0" "multitax>=1.3.1" "genome_updater>=0.6.3"
C++ dependencies:)
@@ -177,7 +180,7 @@ C++ dependencies
Tip
@@ -204,7 +207,7 @@ Downloading and building gano
- to classify extremely large reads or contigs that would need more than 65000 k-mers, use
-DLONGREADS=ON
Installing raptor:)
-
# Python version should be >=3.6
python3 -V
# Install packages via pip or conda:
# PIP
-python3 -m pip install "pandas>=1.1.0" "multitax>=1.3.1"
-# Conda (alternative)
-conda install "pandas>=1.1.0" "multitax>=1.3.1"
+python3 -m pip install "pandas>=1.2.0" "multitax>=1.3.1"
+wget --quiet --show-progress https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh && chmod +x genome_updater.sh
+
+# Conda/Mamba (alternative)
+conda install -c bioconda -c conda-forge "pandas>=1.2.0" "multitax>=1.3.1" "genome_updater>=0.6.3"
C++ dependencies
Tip
@@ -204,7 +207,7 @@ Downloading and building gano
- to classify extremely large reads or contigs that would need more than 65000 k-mers, use
-DLONGREADS=ON
-DLONGREADS=ON
The easiest way to install raptor is via conda with conda install -c bioconda -c conda-forge "raptor>=3.0.1"
(already included in ganon install via conda).
The easiest way to install raptor is via conda with conda install -c bioconda -c conda-forge "raptor=3.0.1"
(already included in ganon install via conda).
Note
raptor is required to build databases with the Hierarchical Interleaved Bloom Filter (ganon build --filter-type hibf
)
@@ -234,7 +237,8 @@
Testingganon -h
Running tests:)
-python3 -m unittest discover -s tests/ganon/integration/
+python3 -m pip install "parameterized>=0.9.0" # Alternative: conda install -c conda-forge "parameterized>=0.9.0"
+python3 -m unittest discover -s tests/ganon/integration/
python3 -m unittest discover -s tests/ganon/integration_online/ # optional - downloads large files
cd build/
ctest -VV .
@@ -246,7 +250,7 @@ ParametersParametersParameters=3.6 pandas >=1.1.0 multitax >=1.3.1 # Python version should be >=3.6 python3 -V # Install packages via pip or conda: # PIP python3 -m pip install \"pandas>=1.1.0\" \"multitax>=1.3.1\" # Conda (alternative) conda install \"pandas>=1.1.0\" \"multitax>=1.3.1\" C++ dependencies :) GCC >=11 CMake >=3.4 zlib bzip2 raptor >=3.0.1 Tip If your system has GCC version 10 or below, you can create an environment with the latest conda-forge GCC version and dependencies: conda create -c conda-forge -n gcc-conda gcc gxx zlib bzip2 cmake and activate the environment with: source activate gcc-conda . In CMake, you may have set the environment include directory with the following parameter: -DSEQAN3_CXX_FLAGS=\"-I/path/to/miniconda3/envs/gcc-conda/include/\" changing /path/to/miniconda3 with your local path to the conda installation. Downloading and building ganon + submodules :) git clone --recurse-submodules https://github.com/pirovc/ganon.git # Install Python side cd ganon python3 setup.py install --record files.txt # optional # Compile and install C++ side mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DVERBOSE_CONFIG=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCONDA=OFF -DLONGREADS=OFF .. make -j 4 sudo make install # optional to change install location (e.g. /myprefix/bin/ ), set the installation prefix in the cmake command with -DCMAKE_INSTALL_PREFIX=/myprefix/ use -DINCLUDE_DIRS to set alternative paths to cxxopts and Catch2 libs. to classify extremely large reads or contigs that would need more than 65000 k-mers, use -DLONGREADS=ON Installing raptor :) The easiest way to install raptor is via conda with conda install -c bioconda -c conda-forge \"raptor>=3.0.1\" (already included in ganon install via conda). Note raptor is required to build databases with the Hierarchical Interleaved Bloom Filter ( ganon build --filter-type hibf ) To build old style ganon indices ganon build --filter-type ibf , raptor is not required To install raptor from source, follow the instructions below: Dependencies :) CMake >= 3.18 GCC 11, 12 or 13 (most recent minor version) Downloading and building raptor + submodules :) git clone --branch raptor-v3.0.1 --recurse-submodules https://github.com/seqan/raptor cd raptor mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=\"-std=c++23 -Wno-interference-size\" .. make -j 4 binaries will be located in the bin directory you may have to inform ganon build the path to the binaries with --raptor-path raptor/build/bin Testing :) If everything was properly installed, the following command should show the help pages without errors: ganon -h Running tests :) python3 -m unittest discover -s tests/ganon/integration/ python3 -m unittest discover -s tests/ganon/integration_online/ # optional - downloads large files cd build/ ctest -VV . Parameters :) usage: ganon [-h] [-v] {build,build-custom,update,classify,reassign,report,table} ... - - - - - - - - - - _ _ _ _ _ (_|(_|| |(_)| | _| v. 2.0.1 - - - - - - - - - - positional arguments: {build,build-custom,update,classify,reassign,report,table} build Download and build ganon default databases (refseq/genbank) build-custom Build custom ganon databases update Update ganon default databases classify Classify reads against built databases reassign Reassign reads with multiple matches with an EM algorithm report Generate reports from classification results table Generate table from reports options: -h, --help show this help message and exit -v, --version Show program's version number and exit. ganon build usage: ganon build [-h] [-g [...]] [-a [...]] [-l] [-b [...]] [-o] [-c] [-r] [-u] [-m [...]] [-z [...]] [--skip-genome-size] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -g [ ...], --organism-group [ ...] One or more organism groups to download [archaea, bacteria, fungi, human, invertebrate, metagenomes, other, plant, protozoa, vertebrate_mammalian, vertebrate_other, viral]. Mutually exclusive --taxid (default: None) -a [ ...], --taxid [ ...] One or more taxonomic identifiers to download. e.g. 562 (-x ncbi) or 's__Escherichia coli' (-x gtdb). Mutually exclusive --organism-group (default: None) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) database arguments: -l , --level Highest level to build the database. Options: any available taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis (default: species) download arguments: -b [ ...], --source [ ...] Source to download [refseq, genbank] (default: ['refseq']) -o , --top Download limited assemblies for each taxa. 0 for all. (default: 0) -c, --complete-genomes Download only sub-set of complete genomes (default: False) -r, --representative-genomes Download only sub-set of representative genomes (default: False) -u , --genome-updater Additional genome_updater parameters (https://github.com/pirovc/genome_updater) (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon build-custom usage: ganon build-custom [-h] [-i [...]] [-e] [-c] [-n] [-a] [-l] [-m [...]] [-z [...]] [--skip-genome-size] [-r [...]] [-q [...]] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). Mutually exclusive --input-file. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: fna.gz) -c, --input-recursive Look for files recursively in folder(s) provided with --input (default: False) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) custom arguments: -n , --input-file Manually set information for input files: file [target node specialization specialization name]. target is the sequence identifier if --input-target sequence (file can be repeated for multiple sequences). if --input-target file and target is not set, filename is used. node is the taxonomic identifier. Mutually exclusive --input (default: None) -a , --input-target Target to use [file, sequence]. By default: 'file' if multiple input files are provided or --input-file is set, 'sequence' if a single file is provided. Using 'file' is recommended and will speed-up the building process (default: None) -l , --level Use a specialized target to build the database. By default, --level is the --input-target. Options: any available taxonomic rank [species, genus, ...] or 'leaves' (requires --taxonomy). Further specialization options [assembly, custom]. assembly will retrieve and use the assembly accession and name. custom requires and uses the specialization field in the --input-file. (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) ncbi arguments: -r [ ...], --ncbi-sequence-info [ ...] Uses NCBI e-utils webservices or downloads accession2taxid files to extract target information. [eutils, nucl_gb, nucl_wgs, nucl_est, nucl_gss, pdb, prot, dead_nucl, dead_wgs, dead_prot or one or more accession2taxid files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/]. By default uses e-utils up-to 50000 sequences or downloads nucl_gb nucl_wgs otherwise. (default: []) -q [ ...], --ncbi-file-info [ ...] Downloads assembly_summary files to extract target information. [refseq, genbank, refseq_historical, genbank_historical or one or more assembly_summary files from https://ftp.ncbi.nlm.nih.gov/genomes/] (default: ['refseq', 'genbank']) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon update usage: ganon update [-h] -d DB_PREFIX [-o] [-t] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -d DB_PREFIX, --db-prefix DB_PREFIX Existing database input prefix (default: None) important arguments: -o , --output-db-prefix Output database prefix. By default will be the same as --db-prefix and overwrite files (default: None) -t , --threads optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon classify usage: ganon classify [-h] -d [DB_PREFIX ...] [-s [reads.fq[.gz] ...]] [-p [reads.1.fq[.gz] reads.2.fq[.gz] ...]] [-c [...]] [-e [...]] [-m] [--ranks [...]] [--min-count] [--report-type] [--skip-report] [-o] [--output-one] [--output-all] [--output-unclassified] [--output-single] [-t] [-b] [-f [...]] [-l [...]] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -d [DB_PREFIX ...], --db-prefix [DB_PREFIX ...] Database input prefix[es] (default: None) -s [reads.fq[.gz] ...], --single-reads [reads.fq[.gz] ...] Multi-fastq[.gz] file[s] to classify (default: None) -p [reads.1.fq[.gz] reads.2.fq[.gz] ...], --paired-reads [reads.1.fq[.gz] reads.2.fq[.gz] ...] Multi-fastq[.gz] pairs of file[s] to classify (default: None) cutoff/filter arguments: -c [ ...], --rel-cutoff [ ...] Min. percentage of a read (set of k-mers) shared with a reference necessary to consider a match. Generally used to remove low similarity matches. Single value or one per database (e.g. 0.7 1 0.25). 0 for no cutoff (default: [0.75]) -e [ ...], --rel-filter [ ...] Additional relative percentage of matches (relative to the best match) to keep. Generally used to keep top matches above cutoff. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [0.1]) post-processing/report arguments: -m , --multiple-matches Method to solve reads with multiple matches [em, lca, skip]. em -> expectation maximization algorithm based on unique matches. lca -> lowest common ancestor based on taxonomy. The EM algorithm can be executed later with 'ganon reassign' using the .all file (--output-all). (default: em) --ranks [ ...] Ranks to report taxonomic abundances (.tre). empty will report default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) --min-count Minimum percentage/counts to report an taxa (.tre) [use values between 0-1 for percentage, >1 for counts] (default: 5e-05) --report-type Type of report (.tre) [abundance, reads, matches, dist, corr]. More info in 'ganon report'. (default: abundance) --skip-report Disable tree-like report (.tre) at the end of classification. Can be done later with 'ganon report'. (default: False) output arguments: -o , --output-prefix Output prefix for output (.rep) and tree-like report (.tre). Empty to output to STDOUT (only .rep) (default: None) --output-one Output a file with one match for each read (.one) either an unique match or a result from the EM or a LCA algorithm (--multiple-matches) (default: False) --output-all Output a file with all unique and multiple matches (.all) (default: False) --output-unclassified Output a file with unclassified read headers (.unc) (default: False) --output-single When using multiple hierarchical levels, output everything in one file instead of one per hierarchy (default: False) other arguments: -t , --threads Number of sub-processes/threads to use (default: 1) -b, --binning Optimized parameters for binning (--rel-cutoff 0.25 --rel-filter 0 --min-count 0 --report-type reads). Will report sequence abundances (.tre) instead of tax. abundance. (default: False) -f [ ...], --fpr-query [ ...] Max. false positive of a query to accept a match. Applied after --rel-cutoff and --rel-filter. Generally used to remove false positives matches querying a database build with large --max-fp. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [1e-05]) -l [ ...], --hierarchy-labels [ ...] Hierarchy definition of --db-prefix files to be classified. Can also be a string, but input will be sorted to define order (e.g. 1 1 2 3). The default value reported without hierarchy is 'H1' (default: None) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon reassign usage: ganon reassign [-h] -i -o OUTPUT_PREFIX [-e] [-s] [--remove-all] [--skip-one] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -i , --input-prefix Input prefix to find files from ganon classify (.all and optionally .rep) (default: None) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for reassigned file (.one and optionally .rep). In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.out' (default: None) EM arguments: -e , --max-iter Max. number of iterations for the EM algorithm. If 0, will run until convergence (check --threshold) (default: 10) -s , --threshold Convergence threshold limit to stop the EM algorithm. (default: 0) other arguments: --remove-all Remove input file (.all) after processing. (default: False) --skip-one Do not write output file (.one) after processing. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon report usage: ganon report [-h] -i [...] [-e INPUT_EXTENSION] -o OUTPUT_PREFIX [-d [...]] [-x] [-m [...]] [-z [...]] [--skip-genome-size] [-f] [-t] [-r [...]] [-s] [-a] [-y] [-p [...]] [-k [...]] [-c] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.rep' file(s) from ganon classify. (default: None) -e INPUT_EXTENSION, --input-extension INPUT_EXTENSION Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: rep) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for report file 'output_prefix.tre'. In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.tre' (default: None) db/tax arguments: -d [ ...], --db-prefix [ ...] Database prefix(es) used for classification. Only '.tax' file(s) are required. If not provided, new taxonomy will be downloaded. Mutually exclusive with --taxonomy. (default: []) -x , --taxonomy Taxonomy database to use [ncbi, gtdb, skip]. Mutually exclusive with --db-prefix. (default: ncbi) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Valid only without --db-prefix. Activate this option when using sequences not representing full genomes. (default: False) output arguments: -f , --output-format Output format [text, tsv, csv, bioboxes]. text outputs a tabulated formatted text file for better visualization. bioboxes is the the CAMI challenge profiling format (only percentage/abundances are reported). (default: tsv) -t , --report-type Type of report [abundance, reads, matches, dist, corr]. 'abundance' -> tax. abundance (re- distribute read counts and correct by genome size), 'reads' -> sequence abundance, 'matches' -> report all unique and shared matches, 'dist' -> like reads with re-distribution of shared read counts only, 'corr' -> like abundance without re-distribution of shared read counts (default: abundance) -r [ ...], --ranks [ ...] Ranks to report ['', 'all', custom list]. 'all' for all possible ranks. empty for default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) -s , --sort Sort report by [rank, lineage, count, unique]. Default: rank (with custom --ranks) or lineage (with --ranks all) (default: ) -a, --no-orphan Omit orphan nodes from the final report. Otherwise, orphan nodes (= nodes not found in the db/tax) are reported as 'na' with root as direct parent. (default: False) -y, --split-hierarchy Split output reports by hierarchy (from ganon classify --hierarchy-labels). If activated, the output files will be named as '{output_prefix}.{hierarchy}.tre' (default: False) -p [ ...], --skip-hierarchy [ ...] One or more hierarchies to skip in the report (from ganon classify --hierarchy-labels) (default: []) -k [ ...], --keep-hierarchy [ ...] One or more hierarchies to keep in the report (from ganon classify --hierarchy-labels) (default: []) -c , --top-percentile Top percentile filter, based on percentage/relative abundance. Applied only at default ranks [superkingdom, phylum, class, order, family, genus, species, assembly] (default: 0) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: []) ganon table usage: ganon table [-h] -i [...] [-e] -o OUTPUT_FILE [-l] [-f] [-t] [-a] [-m] [-r] [-n] [--header] [--unclassified-label] [--filtered-label] [--skip-zeros] [--transpose] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.tre' file(s) from ganon report. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: tre) -o OUTPUT_FILE, --output-file OUTPUT_FILE Output filename for the table (default: None) output arguments: -l , --output-value Output value on the table [percentage, counts]. percentage values are reported between [0-1] (default: counts) -f , --output-format Output format [tsv, csv] (default: tsv) -t , --top-sample Top hits of each sample individually (default: 0) -a , --top-all Top hits of all samples (ranked by percentage) (default: 0) -m , --min-frequency Minimum number/percentage of files containing an taxa to keep the taxa [values between 0-1 for percentage, >1 specific number] (default: 0) -r , --rank Define specific rank to report. Empty will report all ranks. (default: None) -n, --no-root Do not report root node entry and lineage. Direct and shared matches to root will be accounted as unclassified (default: False) --header Header information [name, taxid, lineage] (default: name) --unclassified-label Add column with unclassified count/percentage with the chosen label. May be the same as --filtered-label (e.g. unassigned) (default: None) --filtered-label Add column with filtered count/percentage with the chosen label. May be the same as --unclassified-label (e.g. unassigned) (default: None) --skip-zeros Do not print lines with only zero count/percentage (default: False) --transpose Transpose output table (taxa as cols and files as rows) (default: False) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: [])","title":"ganon2"},{"location":"#ganon","text":"Code: GitHub repository ganon2 pre-print ganon is designed to index large sets of genomic reference sequences and to classify reads against them efficiently. The tool uses Hierarchical Interleaved Bloom Filters as indices based on k-mers with optional minimizers. It was mainly developed, but not limited, to the metagenomics classification problem: quickly assign sequence fragments to their closest reference among thousands of references. After classification, taxonomic or sequence abundances are estimated and reported.","title":"ganon"},{"location":"#features","text":"integrated download and build of any subset from RefSeq/Genbank/GTDB with incremental updates NCBI and GTDB native support for taxonomic classification, custom taxonomy or no taxonomy at all customizable database build for local or non-standard sequence files optimized taxonomic binning and profiling configurations build and classify at various taxonomic levels, strain, assembly, file, sequence or custom specialization hierarchical classification using several databases in one or more levels in just one run EM and/or LCA algorithms to solve multiple-matching reads reporting of multiple and unique matches for every read reporting of sequence, taxonomic or multi-match abundances with optional genome size correction advanced tree-like reports with several filter options generation of contingency tables with several filters for multi-sample studies ganon achieved very good results in our own evaluations but also in independent evaluations: LEMMI , LEMMI v2 and CAMI2","title":"Features"},{"location":"#installation-with-conda","text":"The easiest way to install ganon is via conda, using the bioconda and conda-forge channels: conda install -c bioconda -c conda-forge ganon However, there are possible performance benefits compiling ganon from source in the target machine rather than using the conda version. To do so, please follow the instructions below:","title":"Installation with conda"},{"location":"#installation-from-source","text":"","title":"Installation from source"},{"location":"#python-dependencies","text":"python >=3.6 pandas >=1.1.0 multitax >=1.3.1 # Python version should be >=3.6 python3 -V # Install packages via pip or conda: # PIP python3 -m pip install \"pandas>=1.1.0\" \"multitax>=1.3.1\" # Conda (alternative) conda install \"pandas>=1.1.0\" \"multitax>=1.3.1\"","title":"Python dependencies"},{"location":"#c-dependencies","text":"GCC >=11 CMake >=3.4 zlib bzip2 raptor >=3.0.1 Tip If your system has GCC version 10 or below, you can create an environment with the latest conda-forge GCC version and dependencies: conda create -c conda-forge -n gcc-conda gcc gxx zlib bzip2 cmake and activate the environment with: source activate gcc-conda . In CMake, you may have set the environment include directory with the following parameter: -DSEQAN3_CXX_FLAGS=\"-I/path/to/miniconda3/envs/gcc-conda/include/\" changing /path/to/miniconda3 with your local path to the conda installation.","title":"C++ dependencies"},{"location":"#downloading-and-building-ganon-submodules","text":"git clone --recurse-submodules https://github.com/pirovc/ganon.git # Install Python side cd ganon python3 setup.py install --record files.txt # optional # Compile and install C++ side mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DVERBOSE_CONFIG=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCONDA=OFF -DLONGREADS=OFF .. make -j 4 sudo make install # optional to change install location (e.g. /myprefix/bin/ ), set the installation prefix in the cmake command with -DCMAKE_INSTALL_PREFIX=/myprefix/ use -DINCLUDE_DIRS to set alternative paths to cxxopts and Catch2 libs. to classify extremely large reads or contigs that would need more than 65000 k-mers, use -DLONGREADS=ON","title":"Downloading and building ganon + submodules"},{"location":"#installing-raptor","text":"The easiest way to install raptor is via conda with conda install -c bioconda -c conda-forge \"raptor>=3.0.1\" (already included in ganon install via conda). Note raptor is required to build databases with the Hierarchical Interleaved Bloom Filter ( ganon build --filter-type hibf ) To build old style ganon indices ganon build --filter-type ibf , raptor is not required To install raptor from source, follow the instructions below:","title":"Installing raptor"},{"location":"#dependencies","text":"CMake >= 3.18 GCC 11, 12 or 13 (most recent minor version)","title":"Dependencies"},{"location":"#downloading-and-building-raptor-submodules","text":"git clone --branch raptor-v3.0.1 --recurse-submodules https://github.com/seqan/raptor cd raptor mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=\"-std=c++23 -Wno-interference-size\" .. make -j 4 binaries will be located in the bin directory you may have to inform ganon build the path to the binaries with --raptor-path raptor/build/bin","title":"Downloading and building raptor + submodules"},{"location":"#testing","text":"If everything was properly installed, the following command should show the help pages without errors: ganon -h","title":"Testing"},{"location":"#running-tests","text":"python3 -m unittest discover -s tests/ganon/integration/ python3 -m unittest discover -s tests/ganon/integration_online/ # optional - downloads large files cd build/ ctest -VV .","title":"Running tests"},{"location":"#parameters","text":"usage: ganon [-h] [-v] {build,build-custom,update,classify,reassign,report,table} ... - - - - - - - - - - _ _ _ _ _ (_|(_|| |(_)| | _| v. 2.0.1 - - - - - - - - - - positional arguments: {build,build-custom,update,classify,reassign,report,table} build Download and build ganon default databases (refseq/genbank) build-custom Build custom ganon databases update Update ganon default databases classify Classify reads against built databases reassign Reassign reads with multiple matches with an EM algorithm report Generate reports from classification results table Generate table from reports options: -h, --help show this help message and exit -v, --version Show program's version number and exit. ganon build usage: ganon build [-h] [-g [...]] [-a [...]] [-l] [-b [...]] [-o] [-c] [-r] [-u] [-m [...]] [-z [...]] [--skip-genome-size] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -g [ ...], --organism-group [ ...] One or more organism groups to download [archaea, bacteria, fungi, human, invertebrate, metagenomes, other, plant, protozoa, vertebrate_mammalian, vertebrate_other, viral]. Mutually exclusive --taxid (default: None) -a [ ...], --taxid [ ...] One or more taxonomic identifiers to download. e.g. 562 (-x ncbi) or 's__Escherichia coli' (-x gtdb). Mutually exclusive --organism-group (default: None) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) database arguments: -l , --level Highest level to build the database. Options: any available taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis (default: species) download arguments: -b [ ...], --source [ ...] Source to download [refseq, genbank] (default: ['refseq']) -o , --top Download limited assemblies for each taxa. 0 for all. (default: 0) -c, --complete-genomes Download only sub-set of complete genomes (default: False) -r, --representative-genomes Download only sub-set of representative genomes (default: False) -u , --genome-updater Additional genome_updater parameters (https://github.com/pirovc/genome_updater) (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon build-custom usage: ganon build-custom [-h] [-i [...]] [-e] [-c] [-n] [-a] [-l] [-m [...]] [-z [...]] [--skip-genome-size] [-r [...]] [-q [...]] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). Mutually exclusive --input-file. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: fna.gz) -c, --input-recursive Look for files recursively in folder(s) provided with --input (default: False) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) custom arguments: -n , --input-file Manually set information for input files: file [target node specialization specialization name]. target is the sequence identifier if --input-target sequence (file can be repeated for multiple sequences). if --input-target file and target is not set, filename is used. node is the taxonomic identifier. Mutually exclusive --input (default: None) -a , --input-target Target to use [file, sequence]. By default: 'file' if multiple input files are provided or --input-file is set, 'sequence' if a single file is provided. Using 'file' is recommended and will speed-up the building process (default: None) -l , --level Use a specialized target to build the database. By default, --level is the --input-target. Options: any available taxonomic rank [species, genus, ...] or 'leaves' (requires --taxonomy). Further specialization options [assembly, custom]. assembly will retrieve and use the assembly accession and name. custom requires and uses the specialization field in the --input-file. (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) ncbi arguments: -r [ ...], --ncbi-sequence-info [ ...] Uses NCBI e-utils webservices or downloads accession2taxid files to extract target information. [eutils, nucl_gb, nucl_wgs, nucl_est, nucl_gss, pdb, prot, dead_nucl, dead_wgs, dead_prot or one or more accession2taxid files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/]. By default uses e-utils up-to 50000 sequences or downloads nucl_gb nucl_wgs otherwise. (default: []) -q [ ...], --ncbi-file-info [ ...] Downloads assembly_summary files to extract target information. [refseq, genbank, refseq_historical, genbank_historical or one or more assembly_summary files from https://ftp.ncbi.nlm.nih.gov/genomes/] (default: ['refseq', 'genbank']) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon update usage: ganon update [-h] -d DB_PREFIX [-o] [-t] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -d DB_PREFIX, --db-prefix DB_PREFIX Existing database input prefix (default: None) important arguments: -o , --output-db-prefix Output database prefix. By default will be the same as --db-prefix and overwrite files (default: None) -t , --threads optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon classify usage: ganon classify [-h] -d [DB_PREFIX ...] [-s [reads.fq[.gz] ...]] [-p [reads.1.fq[.gz] reads.2.fq[.gz] ...]] [-c [...]] [-e [...]] [-m] [--ranks [...]] [--min-count] [--report-type] [--skip-report] [-o] [--output-one] [--output-all] [--output-unclassified] [--output-single] [-t] [-b] [-f [...]] [-l [...]] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -d [DB_PREFIX ...], --db-prefix [DB_PREFIX ...] Database input prefix[es] (default: None) -s [reads.fq[.gz] ...], --single-reads [reads.fq[.gz] ...] Multi-fastq[.gz] file[s] to classify (default: None) -p [reads.1.fq[.gz] reads.2.fq[.gz] ...], --paired-reads [reads.1.fq[.gz] reads.2.fq[.gz] ...] Multi-fastq[.gz] pairs of file[s] to classify (default: None) cutoff/filter arguments: -c [ ...], --rel-cutoff [ ...] Min. percentage of a read (set of k-mers) shared with a reference necessary to consider a match. Generally used to remove low similarity matches. Single value or one per database (e.g. 0.7 1 0.25). 0 for no cutoff (default: [0.75]) -e [ ...], --rel-filter [ ...] Additional relative percentage of matches (relative to the best match) to keep. Generally used to keep top matches above cutoff. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [0.1]) post-processing/report arguments: -m , --multiple-matches Method to solve reads with multiple matches [em, lca, skip]. em -> expectation maximization algorithm based on unique matches. lca -> lowest common ancestor based on taxonomy. The EM algorithm can be executed later with 'ganon reassign' using the .all file (--output-all). (default: em) --ranks [ ...] Ranks to report taxonomic abundances (.tre). empty will report default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) --min-count Minimum percentage/counts to report an taxa (.tre) [use values between 0-1 for percentage, >1 for counts] (default: 5e-05) --report-type Type of report (.tre) [abundance, reads, matches, dist, corr]. More info in 'ganon report'. (default: abundance) --skip-report Disable tree-like report (.tre) at the end of classification. Can be done later with 'ganon report'. (default: False) output arguments: -o , --output-prefix Output prefix for output (.rep) and tree-like report (.tre). Empty to output to STDOUT (only .rep) (default: None) --output-one Output a file with one match for each read (.one) either an unique match or a result from the EM or a LCA algorithm (--multiple-matches) (default: False) --output-all Output a file with all unique and multiple matches (.all) (default: False) --output-unclassified Output a file with unclassified read headers (.unc) (default: False) --output-single When using multiple hierarchical levels, output everything in one file instead of one per hierarchy (default: False) other arguments: -t , --threads Number of sub-processes/threads to use (default: 1) -b, --binning Optimized parameters for binning (--rel-cutoff 0.25 --rel-filter 0 --min-count 0 --report-type reads). Will report sequence abundances (.tre) instead of tax. abundance. (default: False) -f [ ...], --fpr-query [ ...] Max. false positive of a query to accept a match. Applied after --rel-cutoff and --rel-filter. Generally used to remove false positives matches querying a database build with large --max-fp. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [1e-05]) -l [ ...], --hierarchy-labels [ ...] Hierarchy definition of --db-prefix files to be classified. Can also be a string, but input will be sorted to define order (e.g. 1 1 2 3). The default value reported without hierarchy is 'H1' (default: None) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon reassign usage: ganon reassign [-h] -i -o OUTPUT_PREFIX [-e] [-s] [--remove-all] [--skip-one] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -i , --input-prefix Input prefix to find files from ganon classify (.all and optionally .rep) (default: None) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for reassigned file (.one and optionally .rep). In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.out' (default: None) EM arguments: -e , --max-iter Max. number of iterations for the EM algorithm. If 0, will run until convergence (check --threshold) (default: 10) -s , --threshold Convergence threshold limit to stop the EM algorithm. (default: 0) other arguments: --remove-all Remove input file (.all) after processing. (default: False) --skip-one Do not write output file (.one) after processing. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon report usage: ganon report [-h] -i [...] [-e INPUT_EXTENSION] -o OUTPUT_PREFIX [-d [...]] [-x] [-m [...]] [-z [...]] [--skip-genome-size] [-f] [-t] [-r [...]] [-s] [-a] [-y] [-p [...]] [-k [...]] [-c] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.rep' file(s) from ganon classify. (default: None) -e INPUT_EXTENSION, --input-extension INPUT_EXTENSION Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: rep) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for report file 'output_prefix.tre'. In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.tre' (default: None) db/tax arguments: -d [ ...], --db-prefix [ ...] Database prefix(es) used for classification. Only '.tax' file(s) are required. If not provided, new taxonomy will be downloaded. Mutually exclusive with --taxonomy. (default: []) -x , --taxonomy Taxonomy database to use [ncbi, gtdb, skip]. Mutually exclusive with --db-prefix. (default: ncbi) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Valid only without --db-prefix. Activate this option when using sequences not representing full genomes. (default: False) output arguments: -f , --output-format Output format [text, tsv, csv, bioboxes]. text outputs a tabulated formatted text file for better visualization. bioboxes is the the CAMI challenge profiling format (only percentage/abundances are reported). (default: tsv) -t , --report-type Type of report [abundance, reads, matches, dist, corr]. 'abundance' -> tax. abundance (re- distribute read counts and correct by genome size), 'reads' -> sequence abundance, 'matches' -> report all unique and shared matches, 'dist' -> like reads with re-distribution of shared read counts only, 'corr' -> like abundance without re-distribution of shared read counts (default: abundance) -r [ ...], --ranks [ ...] Ranks to report ['', 'all', custom list]. 'all' for all possible ranks. empty for default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) -s , --sort Sort report by [rank, lineage, count, unique]. Default: rank (with custom --ranks) or lineage (with --ranks all) (default: ) -a, --no-orphan Omit orphan nodes from the final report. Otherwise, orphan nodes (= nodes not found in the db/tax) are reported as 'na' with root as direct parent. (default: False) -y, --split-hierarchy Split output reports by hierarchy (from ganon classify --hierarchy-labels). If activated, the output files will be named as '{output_prefix}.{hierarchy}.tre' (default: False) -p [ ...], --skip-hierarchy [ ...] One or more hierarchies to skip in the report (from ganon classify --hierarchy-labels) (default: []) -k [ ...], --keep-hierarchy [ ...] One or more hierarchies to keep in the report (from ganon classify --hierarchy-labels) (default: []) -c , --top-percentile Top percentile filter, based on percentage/relative abundance. Applied only at default ranks [superkingdom, phylum, class, order, family, genus, species, assembly] (default: 0) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: []) ganon table usage: ganon table [-h] -i [...] [-e] -o OUTPUT_FILE [-l] [-f] [-t] [-a] [-m] [-r] [-n] [--header] [--unclassified-label] [--filtered-label] [--skip-zeros] [--transpose] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.tre' file(s) from ganon report. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: tre) -o OUTPUT_FILE, --output-file OUTPUT_FILE Output filename for the table (default: None) output arguments: -l , --output-value Output value on the table [percentage, counts]. percentage values are reported between [0-1] (default: counts) -f , --output-format Output format [tsv, csv] (default: tsv) -t , --top-sample Top hits of each sample individually (default: 0) -a , --top-all Top hits of all samples (ranked by percentage) (default: 0) -m , --min-frequency Minimum number/percentage of files containing an taxa to keep the taxa [values between 0-1 for percentage, >1 specific number] (default: 0) -r , --rank Define specific rank to report. Empty will report all ranks. (default: None) -n, --no-root Do not report root node entry and lineage. Direct and shared matches to root will be accounted as unclassified (default: False) --header Header information [name, taxid, lineage] (default: name) --unclassified-label Add column with unclassified count/percentage with the chosen label. May be the same as --filtered-label (e.g. unassigned) (default: None) --filtered-label Add column with filtered count/percentage with the chosen label. May be the same as --unclassified-label (e.g. unassigned) (default: None) --skip-zeros Do not print lines with only zero count/percentage (default: False) --transpose Transpose output table (taxa as cols and files as rows) (default: False) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: [])","title":"Parameters"},{"location":"classification/","text":"Classification :) ganon classify will match single and/or paired-end sets of reads against one or more databases . By default, parameters are optimized for taxonomic profiling , meaning that less reads will be classified but with a higher sensitivity. For example: ganon classify --db-prefix my_db --paired-reads reads.1.fq.gz reads.2.fq.gz --output-prefix results --threads 32 Output files: results.rep : plain report of the run, used to further generate tree-like reports results.tre : tree-like report with cumulative abundances by taxonomic ranks (can be re-generated with ganon report ) By default, ganon classify only write report files. To get files with the classification of each read, use --output-one and/or --output-all . More information about output files here . Note ganon performs taxonomic profiling and/or binning (one tax. assignment for each read) at a taxonomic, strain or sequence level. Some guidelines are listed below, please choose the parameters according to your application. Profiling :) ganon classify is set-up by default to perform taxonomic profiling. It uses: strict thresholds: --rel-cutoff 0.75 and --rel-filter 0.1 --min-count 0.00005 (0.005%) to exclude very low abundant taxa --report-type abundance to generate taxonomic abundances, correcting for genome sizes (more infos here ) Binning :) To achieve better results for taxonomic binning or sequence classification, ganon classify can be configured with --binning , that is the same as: less strict thresholds: --rel-cutoff 0.25 --rel-filter 0 --min-count 0 reports all taxa with at least one read assigned to it --report-type reads will report sequence abundances instead of taxonomic abundances (more infos here ) Tip Database parameters in ganon build can also influence your results. Lower --max-fp (e.g. 0.1, 0.001) and higher --kmer-size (e.g. 23 , 27 ) will improve sensitivity of your results at cost of a larger database and memory usage. Reads with multiple matches :) There are two ways to solve reads with multiple-matches in ganon classify : --multiple-matches em (default): uses an Expectation-Maximization algorithm, re-assigning reads with multiple matches to one most probable target (defined by --level in the build procedure). --multiple-matches lca : uses the Lowest Common Ancestor algorithm, re-assigning reads with multiple matches to higher common ancestors in the taxonomic tree. --multiple-matches skip : will not resolve multi-matching reads Tip The Expectation-Maximization can be performed independently with ganon reassign using the output files .rep and .all . Reports can be generated independently with ganon report using the output file .rep Note --multiple-matches lca paired with --report-type abundance or dist will distribute read counts with multiple matches to one most probable target (defined by --level in the build procedure), instead of a higher taxonomic rank. In this case the distribution is simply based on the number of taxa with unique matches and it is not as precise as the EM algorithm, but it will run faster since the per-read basis re-assignment can be skipped. Classifying more reads :) By default ganon will classify less reads in favour of sensitivity. To classify more reads, use less strict --rel-cutoff and --rel-filter values (e.g. 0.25 and 0 , respectively). More details here . Multiple and Hierarchical classification :) ganon classify can be performed in multiple databases at the same time. The databases can also be provided in a hierarchical order. Multiple database classification can be performed providing several inputs for --db-prefix . They are required to be built with the same --kmer-size and --window-size values. Multiple databases are considered as one (as if built together) and redundancy in content (same reference in two or more databases) is allowed. To classify reads in a hierarchical order, --hierarchy-labels should be provided. When using multiple hierarchical levels, output files will be generated for each level (use --output-single to generate a single output from multiple hierarchical levels). Please note that some parameters are set for each database (e.g. --rel-cutoff ) while others are set for each hierarchical level (e.g. --rel-filter ) Examples Classification against 3 database (as if they were one) using the same cutoff: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.75 \\ --single-reads reads.fq.gz Classification against 3 database (as if they were one) using different error rates for each: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.2 0.3 0.1 \\ --single-reads reads.fq.gz In this example, reads are going to be classified first against db1 and db2. Reads without a valid match will be further classified against db3. `--hierarchy-labels` are strings and are going to be sorted to define the hierarchy order, disregarding input order: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --single-reads reads.fq.gz In this example, classification will be performed with different `--rel-cutoff` for each database. For each hierarchy levels (`1_first` and `2_second`) a different `--rel-filter` will be used: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --rel-cutoff 1 0.5 0.25 \\ --rel-filter 0.1 0.5 \\ --single-reads reads.fq.gz Parameter details :) reads (--single-reads, --paired-reads) :) ganon accepts single-end and paired-end reads. Both types can be use at the same time. In paired-end mode, reads are always reported with the header of the first pair. Paired-end reads are classified in a standard forward-reverse orientation. The max. read length accepted is 65535 (accounting both reads in paired mode). cutoff and filter (--rel-cutoff, --rel-filter) :) ganon has two main parameters to control the number and strictess of matches between reads and references: --rel-cutoff and --rel-filter . Every read can be classified against none, one or more references. ganon will report only matches after cutoff and filter thresholds are applied, based on the number of shared k-mers between sequences (use --rel-cutoff 0 and --rel-filter 1 to deactivate them). The cutoff is the first to be applied. It sets the min. percentage of k-mers of a read to be shared with a reference to consider a match. Next the filter is applied to the remaining matches. filter thresholds are relative to the best and worst scoring match after cutoff and control the percentage of additional matches (if any) should be reported, sorted from the best to worst. filter won't change the total number of matched reads but will change the amount of unique or multi-matched reads. cutoff can be interpreted as the lower bound to discard spurious matches and filter as the fine tuning to control what to keep. In summary: --rel-cutoff controls the strictness of the matching algorithm. lower values -> more read matches higher values -> less read matches --rel-filter controls how many matches each read will have, from best to worst lower values -> more unique matching reads higher values -> more multi-matching reads For example, using a hypothetical number of k-mer matches, a certain read with 82 k-mers has the following matches with the 5 references ( ref1..5 ), sorted number of shared k-mers: reference shared k-mers ref1 82 ref2 68 ref3 44 ref4 25 ref5 20 With --rel-cutoff 0.25 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 ref1 82 ref2 68 ref3 44 ref4 25 ~~ref5~~ ~~20~~ X since the --rel-cutoff threshold is 82 * 0.25 = 21 (ceiling is applied). Next, with --rel-filter 0.5 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 --rel-filter 0.5 ref1 82 ref2 68 ~~ref3~~ ~~44~~ X ~~ref4~~ ~~25~~ X ~~ref5~~ ~~20~~ X since 82 is the best match and 25 is the worst remaining match, the filter will keep the top the remaining matches, based on the shared k-mers threshold 82 - ((82-25)*0.5) = 54 (ceiling is applied). ref1 and ref2 are reported as matches Tip The actual number of unique k-mers in a read are used as an upper bound to calculate the thresholds. The same is applied when using --window-size and minimizers. Note A different --rel-cutoff can be set for every database in a multiple or hierarchical database classification. A different --rel-filter can be set for every level of a hierarchical database classification. Note Reads that remain with only one reference match (after cutoff and filter are applied) are considered a unique match. False positive of a query (--fpr-query) :) ganon uses Bloom Filters, probabilistic data structures that may return false positive results. The base false positive of a ganon index is controlled by --max-fp when building the database. However, this value is the expected false positive for each k-mer. In practice, a sequence (several k-mers) will have a way smaller false positive. ganon calculates the false positive rate of a query as suggested by (Solomon and Kingsford, 2016). The --fpr-query will control the max. value accepted to consider a match between a sequence and a reference, avoiding false positives that may be introduce by the properties of the data structure. By default, --fpr-query 1e-5 is used and it is applied after the --rel-cutoff and --rel-filter . Values between 1e-3 and 1e-10 are recommended. This threshold becomes more important when building smaller databases with higher --max-fp , assuring that the false positive is under control. In this case however, sensitivity of results may decrease. Note The false positive of a query was first propose in: Solomon, Brad, and Carl Kingsford. \u201cFast Search of Thousands of Short-Read Sequencing Experiments.\u201d Nature Biotechnology 34, no. 3 (2016): 1\u20136. https://doi.org/10.1038/nbt.3442.","title":"Classification (ganon classify)"},{"location":"classification/#classification","text":"ganon classify will match single and/or paired-end sets of reads against one or more databases . By default, parameters are optimized for taxonomic profiling , meaning that less reads will be classified but with a higher sensitivity. For example: ganon classify --db-prefix my_db --paired-reads reads.1.fq.gz reads.2.fq.gz --output-prefix results --threads 32 Output files: results.rep : plain report of the run, used to further generate tree-like reports results.tre : tree-like report with cumulative abundances by taxonomic ranks (can be re-generated with ganon report ) By default, ganon classify only write report files. To get files with the classification of each read, use --output-one and/or --output-all . More information about output files here . Note ganon performs taxonomic profiling and/or binning (one tax. assignment for each read) at a taxonomic, strain or sequence level. Some guidelines are listed below, please choose the parameters according to your application.","title":"Classification"},{"location":"classification/#profiling","text":"ganon classify is set-up by default to perform taxonomic profiling. It uses: strict thresholds: --rel-cutoff 0.75 and --rel-filter 0.1 --min-count 0.00005 (0.005%) to exclude very low abundant taxa --report-type abundance to generate taxonomic abundances, correcting for genome sizes (more infos here )","title":"Profiling"},{"location":"classification/#binning","text":"To achieve better results for taxonomic binning or sequence classification, ganon classify can be configured with --binning , that is the same as: less strict thresholds: --rel-cutoff 0.25 --rel-filter 0 --min-count 0 reports all taxa with at least one read assigned to it --report-type reads will report sequence abundances instead of taxonomic abundances (more infos here ) Tip Database parameters in ganon build can also influence your results. Lower --max-fp (e.g. 0.1, 0.001) and higher --kmer-size (e.g. 23 , 27 ) will improve sensitivity of your results at cost of a larger database and memory usage.","title":"Binning"},{"location":"classification/#reads-with-multiple-matches","text":"There are two ways to solve reads with multiple-matches in ganon classify : --multiple-matches em (default): uses an Expectation-Maximization algorithm, re-assigning reads with multiple matches to one most probable target (defined by --level in the build procedure). --multiple-matches lca : uses the Lowest Common Ancestor algorithm, re-assigning reads with multiple matches to higher common ancestors in the taxonomic tree. --multiple-matches skip : will not resolve multi-matching reads Tip The Expectation-Maximization can be performed independently with ganon reassign using the output files .rep and .all . Reports can be generated independently with ganon report using the output file .rep Note --multiple-matches lca paired with --report-type abundance or dist will distribute read counts with multiple matches to one most probable target (defined by --level in the build procedure), instead of a higher taxonomic rank. In this case the distribution is simply based on the number of taxa with unique matches and it is not as precise as the EM algorithm, but it will run faster since the per-read basis re-assignment can be skipped.","title":"Reads with multiple matches"},{"location":"classification/#classifying-more-reads","text":"By default ganon will classify less reads in favour of sensitivity. To classify more reads, use less strict --rel-cutoff and --rel-filter values (e.g. 0.25 and 0 , respectively). More details here .","title":"Classifying more reads"},{"location":"classification/#multiple-and-hierarchical-classification","text":"ganon classify can be performed in multiple databases at the same time. The databases can also be provided in a hierarchical order. Multiple database classification can be performed providing several inputs for --db-prefix . They are required to be built with the same --kmer-size and --window-size values. Multiple databases are considered as one (as if built together) and redundancy in content (same reference in two or more databases) is allowed. To classify reads in a hierarchical order, --hierarchy-labels should be provided. When using multiple hierarchical levels, output files will be generated for each level (use --output-single to generate a single output from multiple hierarchical levels). Please note that some parameters are set for each database (e.g. --rel-cutoff ) while others are set for each hierarchical level (e.g. --rel-filter ) Examples Classification against 3 database (as if they were one) using the same cutoff: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.75 \\ --single-reads reads.fq.gz Classification against 3 database (as if they were one) using different error rates for each: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.2 0.3 0.1 \\ --single-reads reads.fq.gz In this example, reads are going to be classified first against db1 and db2. Reads without a valid match will be further classified against db3. `--hierarchy-labels` are strings and are going to be sorted to define the hierarchy order, disregarding input order: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --single-reads reads.fq.gz In this example, classification will be performed with different `--rel-cutoff` for each database. For each hierarchy levels (`1_first` and `2_second`) a different `--rel-filter` will be used: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --rel-cutoff 1 0.5 0.25 \\ --rel-filter 0.1 0.5 \\ --single-reads reads.fq.gz","title":"Multiple and Hierarchical classification"},{"location":"classification/#parameter-details","text":"","title":"Parameter details"},{"location":"classification/#reads-single-reads-paired-reads","text":"ganon accepts single-end and paired-end reads. Both types can be use at the same time. In paired-end mode, reads are always reported with the header of the first pair. Paired-end reads are classified in a standard forward-reverse orientation. The max. read length accepted is 65535 (accounting both reads in paired mode).","title":"reads (--single-reads, --paired-reads)"},{"location":"classification/#cutoff-and-filter-rel-cutoff-rel-filter","text":"ganon has two main parameters to control the number and strictess of matches between reads and references: --rel-cutoff and --rel-filter . Every read can be classified against none, one or more references. ganon will report only matches after cutoff and filter thresholds are applied, based on the number of shared k-mers between sequences (use --rel-cutoff 0 and --rel-filter 1 to deactivate them). The cutoff is the first to be applied. It sets the min. percentage of k-mers of a read to be shared with a reference to consider a match. Next the filter is applied to the remaining matches. filter thresholds are relative to the best and worst scoring match after cutoff and control the percentage of additional matches (if any) should be reported, sorted from the best to worst. filter won't change the total number of matched reads but will change the amount of unique or multi-matched reads. cutoff can be interpreted as the lower bound to discard spurious matches and filter as the fine tuning to control what to keep. In summary: --rel-cutoff controls the strictness of the matching algorithm. lower values -> more read matches higher values -> less read matches --rel-filter controls how many matches each read will have, from best to worst lower values -> more unique matching reads higher values -> more multi-matching reads For example, using a hypothetical number of k-mer matches, a certain read with 82 k-mers has the following matches with the 5 references ( ref1..5 ), sorted number of shared k-mers: reference shared k-mers ref1 82 ref2 68 ref3 44 ref4 25 ref5 20 With --rel-cutoff 0.25 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 ref1 82 ref2 68 ref3 44 ref4 25 ~~ref5~~ ~~20~~ X since the --rel-cutoff threshold is 82 * 0.25 = 21 (ceiling is applied). Next, with --rel-filter 0.5 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 --rel-filter 0.5 ref1 82 ref2 68 ~~ref3~~ ~~44~~ X ~~ref4~~ ~~25~~ X ~~ref5~~ ~~20~~ X since 82 is the best match and 25 is the worst remaining match, the filter will keep the top the remaining matches, based on the shared k-mers threshold 82 - ((82-25)*0.5) = 54 (ceiling is applied). ref1 and ref2 are reported as matches Tip The actual number of unique k-mers in a read are used as an upper bound to calculate the thresholds. The same is applied when using --window-size and minimizers. Note A different --rel-cutoff can be set for every database in a multiple or hierarchical database classification. A different --rel-filter can be set for every level of a hierarchical database classification. Note Reads that remain with only one reference match (after cutoff and filter are applied) are considered a unique match.","title":"cutoff and filter (--rel-cutoff, --rel-filter)"},{"location":"classification/#false-positive-of-a-query-fpr-query","text":"ganon uses Bloom Filters, probabilistic data structures that may return false positive results. The base false positive of a ganon index is controlled by --max-fp when building the database. However, this value is the expected false positive for each k-mer. In practice, a sequence (several k-mers) will have a way smaller false positive. ganon calculates the false positive rate of a query as suggested by (Solomon and Kingsford, 2016). The --fpr-query will control the max. value accepted to consider a match between a sequence and a reference, avoiding false positives that may be introduce by the properties of the data structure. By default, --fpr-query 1e-5 is used and it is applied after the --rel-cutoff and --rel-filter . Values between 1e-3 and 1e-10 are recommended. This threshold becomes more important when building smaller databases with higher --max-fp , assuring that the false positive is under control. In this case however, sensitivity of results may decrease. Note The false positive of a query was first propose in: Solomon, Brad, and Carl Kingsford. \u201cFast Search of Thousands of Short-Read Sequencing Experiments.\u201d Nature Biotechnology 34, no. 3 (2016): 1\u20136. https://doi.org/10.1038/nbt.3442.","title":"False positive of a query (--fpr-query)"},{"location":"custom_databases/","text":"Custom databases :) Default NCBI assembly or sequence accession :) Besides the automated download and build ( ganon build ) ganon provides a highly customizable build procedure ( ganon build-custom ) to create databases from local sequence files. To use custom sequences, just provide them with --input . ganon will try to retrieve all necessary information necessary to build a database. Note ganon expects assembly accessions in the filename like GCA_002211645.1_ASM221164v1_genomic.fna.gz . When using --input-target sequence filenames are not important but sequence headers should contain sequence accessions like >CP022124.1 Fusobacterium nu... . More information about building by file or sequence can be found here . Non-standard/custom accessions :) It is also possible to use non-standard accessions and headers to build custom databases with --input-file . This file should contain the following fields (tab-separated): file [ target node specialization specialization_name]. Note that file is mandatory and additional fields not. Tip If you just want to build a database without any taxonomic or target information, just sent the files with --input , use --taxonomy skip and choose between --input-target file or sequence . Warning the target and specialization fields (2nd and 4th col) cannot be the same as the node (3rd col) Examples of --input-file With --input-target file (default), where my_target_1 and my_target_2 are just names to assign sequences from (unique) sequence files: sequences.fasta my_target_1 others.fasta my_target_2 With --input-target sequence, second column should match sequence headers on provided sequence files (that should be repeated for each header): sequences.fasta HEADER1 sequences.fasta HEADER2 sequences.fasta HEADER3 others.fasta HEADER4 others.fasta HEADER5 A third column with taxonomic nodes can be provided to link the data with taxonomy. For example with --taxonomy ncbi: sequences.fasta FILE_A 562 others.fasta FILE_B 623 sequences.fasta HEADER1 562 sequences.fasta HEADER2 562 sequences.fasta HEADER3 562 others.fasta HEADER4 623 others.fasta HEADER5 623 Further specializations can be used to create a additional classification level after the taxonomic leaves. For example (using --level custom): sequences.fasta FILE_A 562 ID44444 Escherichia coli TW10119 others.fasta FILE_B 623 ID55555 Shigella flexneri 1a sequences.fasta HEADER1 562 ID443 Escherichia coli TW10119 sequences.fasta HEADER2 562 ID297 Escherichia coli PCN079 sequences.fasta HEADER3 562 ID8873 Escherichia coli P0301867.7 others.fasta HEADER4 623 ID2241 Shigella flexneri 1a others.fasta HEADER5 623 ID4422 Shigella flexneri 1b Examples :) Some examples with download and build commands for custom ganon databases from useful and commonly used repositories and datasets for metagenomics analysis: HumGut :) Collection of >30000 genomes from healthy human metagenomes. Article / Website . # Download sequence files wget --quiet --show-progress \"http://arken.nmbu.no/~larssn/humgut/HumGut.tar.gz\" tar xf HumGut.tar.gz # Download taxonomy and metadata files wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_names.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/HumGut.tsv\" # Generate --input-file from metadata tail -n+2 HumGut.tsv | awk -F\"\\t\" '{print \"fna/\"$21\"\\t\"$1\"\\t\"$2}' > HumGut_ganon_input_file.tsv # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files ncbi_nodes.dmp ncbi_names.dmp --db-prefix HumGut --level strain --threads 32 Similarly using GTDB taxonomy files: # Download taxonomy files wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_names.dmp\" # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files gtdb_nodes.dmp gtdb_names.dmp --db-prefix HumGut_gtdb --level strain --threads 32 Note There is no need to use ganon's gtdb integration here since GTDB files in NCBI format are available Plasmid, Plastid and Mitochondrion from RefSeq :) Extra repositories from RefSeq release not included as default databases. Website . # Download sequence files wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/\" # Split sequences in files and retrieve taxonomy mkdir sequences/ zcat plasmid.* plastid.* mitochondrion.* | awk '$0 ~ \">\" {accver=(substr($1,2)); print accver}{print $0 > \"sequences/\"accver\".fna\"}' | ganon-get-seq-info.sh -e -i - | awk '{print \"sequences/\"$1\".fna\\t\"$1\"\\t\"$3}' > ppm.tsv # Build ganon database ganon build-custom --input-file ppm.tsv --db-prefix ppm --level species --threads 16 # OPTIONAL Remove temporary folder and downloaded files rm -rf sequences/ ppm.tsv plasmid.* plastid.* mitochondrion.* UniVec, UniVec_core :) \"UniVec is a non-redundant database of sequences commonly attached to cDNA or genomic DNA during the cloning process.\" Website . Useful to screen for vector and linker/adapter contamination. UniVec_core is a sub-set of the UniVec selected to reduce the false positive hits from real biological sources. # UniVec wget -O \"UniVec.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec\" echo -e \"UniVec.fasta\\tUniVec\\t81077\" > UniVec_Core_ganon_input_file.tsv ganon build-custom --input-file UniVec_ganon_input_file.tsv --db-prefix UniVec --level leaves --threads 8 # UniVec_Core wget -O \"UniVec_Core.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec_Core\" echo -e \"UniVec_Core.fasta\\tUniVec_Core\\t81077\" > UniVec_Core_ganon_input_file.tsv ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves --threads 8 Note All UniVec entries in the examples are categorized as Artificial Sequence (NCBI txid:81077). Some are completely artificial but others may be derived from real biological sources. More information in this link . MGnify genome catalogues (MAGs) :) \"Genome catalogues are biome-specific collections of metagenomic-assembled and isolate genomes\". Article / Website / FTP . There are currently (2023-05-04) 8 genome catalogues available: chicken-gut, human-gut, human-oral, marine, non-model-fish-gut, pig-gut and zebrafish-fecal. An example below how to download and build the human-oral catalog: # Download metadata wget \"https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-oral/v1.0/genomes-all_metadata.tsv\" # Download sequence files with 12 threads tail -n+2 genomes-all_metadata.tsv | cut -f 1,20 | xargs -P 12 -n2 sh -c 'curl --silent ${1}| gzip -d | sed -e \"1,/##FASTA/ d\" | gzip > ${0}.fna.gz' # Generate ganon input file tail -n+2 genomes-all_metadata.tsv | cut -f 1,15 | tr ';' '\\t' | awk -F\"\\t\" '{tax=\"1\";for(i=NF;i>1;i--){if(length($i)>3){tax=$i;break;}};print $1\".fna.gz\\t\"$1\"\\t\"tax}' > ganon_input_file.tsv # Build ganon database ganon build-custom --input-file ganon_input_file.tsv --db-prefix mgnify_human_oral_v1 --taxonomy gtdb --level leaves --threads 32 Note MGnify genomes catalogues will be build with GTDB taxonomy. Pathogen detection FDA-ARGOS :) A collection of >1400 \"microbes that include biothreat microorganisms, common clinical pathogens and closely related species\". Article / Website / BioProject . # Download sequence files wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt grep \"strain=FDAARGOS\" assembly_summary_refseq.txt > fdaargos_assembly_summary.txt genome_updater.sh -e fdaargos_assembly_summary.txt -f \"genomic.fna.gz\" -o download -m -t 12 # Build ganon database ganon build-custom --input download/ --input-recursive --db-prefix fdaargos --ncbi-file-info download/assembly_summary.txt --level assembly --threads 32 Note The example above uses genome_updater to download files BLAST databases (nt env_nt nt_prok ...) :) BLAST databases. Website / FTP . Current available nucleotide databases (2023-05-04): 16S_ribosomal_RNA 18S_fungal_sequences 28S_fungal_sequences Betacoronavirus env_nt human_genome ITS_eukaryote_sequences ITS_RefSeq_Fungi LSU_eukaryote_rRNA LSU_prokaryote_rRNA mito mouse_genome nt nt_euk nt_others nt_prok nt_viruses patnt pdbnt ref_euk_rep_genomes ref_prok_rep_genomes refseq_rna refseq_select_rna ref_viroids_rep_genomes ref_viruses_rep_genomes SSU_eukaryote_rRNA tsa_nt Note List currently available nucleotide databases curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"nucl-metadata.json\" | sed 's/-nucl-metadata.json/, /g' | sort Warning Some BLAST databases are very big and may require extreme computational resources to build. You may need to use some reduction strategies The example below extracts sequences and information from a BLAST db to build a ganon database: # Define BLAST db db=\"16S_ribosomal_RNA\" threads=8 # Download BLAST db - re-run this command many times until all finish (no more output) curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"^${db}\\..*tar.gz$\" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress \"https://ftp.ncbi.nlm.nih.gov/blast/db/{}\" # OPTIONAL Download and check MD5 wget -O - -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}\\.*tar.gz.md5\" > \"${db}.md5\" find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads:-1} -I{} md5sum {} > \"${db}_downloaded.md5\" diff -sy <(sort -k 2,2 \"${db}.md5\") <(sort -k 2,2 \"${db}_downloaded.md5\") # Should print \"Files /dev/fd/xx and /dev/fd/xx are identical\" # Extract BLAST db files, if successful, remove .tar.gz find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > \"${db}_extracted_files.txt\" # Create folder to write sequence files (split into 10 sub-folders) seq 0 9 | xargs -i mkdir -p \"${db}\"/{} # This command extracts sequences from the blastdb and writes them into taxid specific files # It also generates the --input-file for ganon blastdbcmd -entry all -db \"${db}\" -outfmt \"%a %T %s\" | \\ awk -v db=\"$(realpath ${db})\" '{file=db\"/\"substr($2,1,1)\"/\"$2\".fna\"; print \">\"$1\"\\n\"$3 >> file; print file\"\\t\"$2\"\\t\"$2}' | \\ sort | uniq > \"${db}_ganon_input_file.tsv\" # Build ganon database ganon build-custom --input-file \"${db}_ganon_input_file.tsv\" --db-prefix \"${db}\" --threads 12 # Delete extracted files and auxiliary files cat \"${db}_extracted_files.txt\" | xargs rm rm \"${db}_extracted_files.txt\" \"${db}.md5\" \"${db}_downloaded.md5\" # Delete sequences rm -rf \"${db}\" \"${db}_ganon_input_file.tsv\" Note blastdbcmd is a command from BLAST+ software suite (tested version 2.14.0) and should be installed separately. Files from genome_updater :) To create a ganon database from files previosly downloaded with genome_updater : ganon build-custom --input output_folder_genome_updater/version/ --input-recursive --db-prefix mydb --ncbi-file-info output_folder_genome_updater/assembly_summary.txt --level assembly --threads 32 Parameter details :) False positive and size (--max-fp, --filter-size) :) ganon indices are based on bloom filters and can have false positive matches. This can be controlled with --max-fp parameter. The lower the --max-fp , the less chances of false positives matches on classification, but the larger the database size will be. For example, with --max-fp 0.01 the database will be build so any target (defined by --level ) will have 1 in a 100 change of reporting a false k-mer match. The false positive of the query (all k-mers of a read) will be way lower, but directly affected by this value. Alternatively, one can set a specific size for the final index with --filter-size . When using this option, please observe the theoretic false positive of the index reported at the end of the building process. minimizers (--window-size, --kmer-size) :) in ganon build , when --window-size > --kmer-size minimizers are used. That means that for a every window, a single k-mer will be selected. It produces smaller database files and requires substantially less memory overall. It may increase building times but will have a huge benefit for classification times. Sensitivity and precision can be reduced by small margins. If --window-size = --kmer-size , all k-mers are going to be used to build the database. Target file or sequence (--input-target) :) Customized builds can be done either by file or sequence. --input-target file will consider every file provided with --input a single unit. --input-target sequence will use every sequence as a unit. --input-target file is the default behavior and most efficient way to build databases. --input-target sequence should only be used when the input sequences are stored in a single file or when classification at sequence level is desired. Build level (--level) :) The --level parameter defines the max. depth of the database for classification. This parameter is relevant because the --max-fp is going to be guaranteed at the --level chosen. By default, the level will be the same as --input-target , meaning that classification will be done either at file or sequence level. Alternatively, --level assembly will link the file or sequence target information with assembly accessions retrieved from NCBI servers. --level leaves or --level species (or genus, family, ...) will link the targets with taxonomic information and prune the tree at the chosen level. --level custom will use specialization level define in the --input-file . Genome sizes (--genome-size-files) :) Ganon will automatically download auxiliary files to define an approximate genome size for each entry in the taxonomic tree. For --taxonomy ncbi the species_genome_size.txt.gz is used. For --taxonomy gtdb the *_metadata.tar.gz files are used. Those files can be directly provided with the --genome-size-files argument. Genome sizes of parent nodes are calculated as the average of the respective children nodes. Other nodes without direct assigned genome sizes will use the closest parent with a pre-calculated genome size. The genome sizes are stored in the ganon database . Retrieving info (--ncbi-sequence-info, --ncbi-file-info) :) Further taxonomy and assembly linking information has to be collected to properly build the database. --ncbi-sequence-info and --ncbi-file-info allow customizations on this step. When --input-target sequence , --ncbi-sequence-info argument allows the use of NCBI e-utils webservices ( eutils ) or downloads accession2taxid files to extract target information (options nucl_gb nucl_wgs nucl_est nucl_gss pdb prot dead_nucl dead_wgs dead_prot ). By default, ganon uses eutils up-to 50000 input sequences, otherwise it downloads nucl_gb nucl_wgs from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/. Previously downloaded files can be directly provided with this argument. When --input-target file , --ncbi-file-info uses assembly_summary.txt from https://ftp.ncbi.nlm.nih.gov/genomes/ to extract target information (options refseq genbank refseq_historical genbank_historical . Previously downloaded files can be directly provided with this argument. If you are using outdated, removed or inactive assembly or sequence files and accessions from NCBI, make sure to include dead_nucl dead_wgs for --ncbi-sequence-info or refseq_historical genbank_historical for --ncbi-file-info . eutils option does not work with outdated accessions.","title":"Custom databases (ganon build-custom)"},{"location":"custom_databases/#custom-databases","text":"","title":"Custom databases"},{"location":"custom_databases/#default-ncbi-assembly-or-sequence-accession","text":"Besides the automated download and build ( ganon build ) ganon provides a highly customizable build procedure ( ganon build-custom ) to create databases from local sequence files. To use custom sequences, just provide them with --input . ganon will try to retrieve all necessary information necessary to build a database. Note ganon expects assembly accessions in the filename like GCA_002211645.1_ASM221164v1_genomic.fna.gz . When using --input-target sequence filenames are not important but sequence headers should contain sequence accessions like >CP022124.1 Fusobacterium nu... . More information about building by file or sequence can be found here .","title":"Default NCBI assembly or sequence accession"},{"location":"custom_databases/#non-standardcustom-accessions","text":"It is also possible to use non-standard accessions and headers to build custom databases with --input-file . This file should contain the following fields (tab-separated): file [ target node specialization specialization_name]. Note that file is mandatory and additional fields not. Tip If you just want to build a database without any taxonomic or target information, just sent the files with --input , use --taxonomy skip and choose between --input-target file or sequence . Warning the target and specialization fields (2nd and 4th col) cannot be the same as the node (3rd col) Examples of --input-file With --input-target file (default), where my_target_1 and my_target_2 are just names to assign sequences from (unique) sequence files: sequences.fasta my_target_1 others.fasta my_target_2 With --input-target sequence, second column should match sequence headers on provided sequence files (that should be repeated for each header): sequences.fasta HEADER1 sequences.fasta HEADER2 sequences.fasta HEADER3 others.fasta HEADER4 others.fasta HEADER5 A third column with taxonomic nodes can be provided to link the data with taxonomy. For example with --taxonomy ncbi: sequences.fasta FILE_A 562 others.fasta FILE_B 623 sequences.fasta HEADER1 562 sequences.fasta HEADER2 562 sequences.fasta HEADER3 562 others.fasta HEADER4 623 others.fasta HEADER5 623 Further specializations can be used to create a additional classification level after the taxonomic leaves. For example (using --level custom): sequences.fasta FILE_A 562 ID44444 Escherichia coli TW10119 others.fasta FILE_B 623 ID55555 Shigella flexneri 1a sequences.fasta HEADER1 562 ID443 Escherichia coli TW10119 sequences.fasta HEADER2 562 ID297 Escherichia coli PCN079 sequences.fasta HEADER3 562 ID8873 Escherichia coli P0301867.7 others.fasta HEADER4 623 ID2241 Shigella flexneri 1a others.fasta HEADER5 623 ID4422 Shigella flexneri 1b","title":"Non-standard/custom accessions"},{"location":"custom_databases/#examples","text":"Some examples with download and build commands for custom ganon databases from useful and commonly used repositories and datasets for metagenomics analysis:","title":"Examples"},{"location":"custom_databases/#humgut","text":"Collection of >30000 genomes from healthy human metagenomes. Article / Website . # Download sequence files wget --quiet --show-progress \"http://arken.nmbu.no/~larssn/humgut/HumGut.tar.gz\" tar xf HumGut.tar.gz # Download taxonomy and metadata files wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_names.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/HumGut.tsv\" # Generate --input-file from metadata tail -n+2 HumGut.tsv | awk -F\"\\t\" '{print \"fna/\"$21\"\\t\"$1\"\\t\"$2}' > HumGut_ganon_input_file.tsv # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files ncbi_nodes.dmp ncbi_names.dmp --db-prefix HumGut --level strain --threads 32 Similarly using GTDB taxonomy files: # Download taxonomy files wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_names.dmp\" # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files gtdb_nodes.dmp gtdb_names.dmp --db-prefix HumGut_gtdb --level strain --threads 32 Note There is no need to use ganon's gtdb integration here since GTDB files in NCBI format are available","title":"HumGut"},{"location":"custom_databases/#plasmid-plastid-and-mitochondrion-from-refseq","text":"Extra repositories from RefSeq release not included as default databases. Website . # Download sequence files wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/\" # Split sequences in files and retrieve taxonomy mkdir sequences/ zcat plasmid.* plastid.* mitochondrion.* | awk '$0 ~ \">\" {accver=(substr($1,2)); print accver}{print $0 > \"sequences/\"accver\".fna\"}' | ganon-get-seq-info.sh -e -i - | awk '{print \"sequences/\"$1\".fna\\t\"$1\"\\t\"$3}' > ppm.tsv # Build ganon database ganon build-custom --input-file ppm.tsv --db-prefix ppm --level species --threads 16 # OPTIONAL Remove temporary folder and downloaded files rm -rf sequences/ ppm.tsv plasmid.* plastid.* mitochondrion.*","title":"Plasmid, Plastid and Mitochondrion from RefSeq"},{"location":"custom_databases/#univec-univec_core","text":"\"UniVec is a non-redundant database of sequences commonly attached to cDNA or genomic DNA during the cloning process.\" Website . Useful to screen for vector and linker/adapter contamination. UniVec_core is a sub-set of the UniVec selected to reduce the false positive hits from real biological sources. # UniVec wget -O \"UniVec.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec\" echo -e \"UniVec.fasta\\tUniVec\\t81077\" > UniVec_Core_ganon_input_file.tsv ganon build-custom --input-file UniVec_ganon_input_file.tsv --db-prefix UniVec --level leaves --threads 8 # UniVec_Core wget -O \"UniVec_Core.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec_Core\" echo -e \"UniVec_Core.fasta\\tUniVec_Core\\t81077\" > UniVec_Core_ganon_input_file.tsv ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves --threads 8 Note All UniVec entries in the examples are categorized as Artificial Sequence (NCBI txid:81077). Some are completely artificial but others may be derived from real biological sources. More information in this link .","title":"UniVec, UniVec_core"},{"location":"custom_databases/#mgnify-genome-catalogues-mags","text":"\"Genome catalogues are biome-specific collections of metagenomic-assembled and isolate genomes\". Article / Website / FTP . There are currently (2023-05-04) 8 genome catalogues available: chicken-gut, human-gut, human-oral, marine, non-model-fish-gut, pig-gut and zebrafish-fecal. An example below how to download and build the human-oral catalog: # Download metadata wget \"https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-oral/v1.0/genomes-all_metadata.tsv\" # Download sequence files with 12 threads tail -n+2 genomes-all_metadata.tsv | cut -f 1,20 | xargs -P 12 -n2 sh -c 'curl --silent ${1}| gzip -d | sed -e \"1,/##FASTA/ d\" | gzip > ${0}.fna.gz' # Generate ganon input file tail -n+2 genomes-all_metadata.tsv | cut -f 1,15 | tr ';' '\\t' | awk -F\"\\t\" '{tax=\"1\";for(i=NF;i>1;i--){if(length($i)>3){tax=$i;break;}};print $1\".fna.gz\\t\"$1\"\\t\"tax}' > ganon_input_file.tsv # Build ganon database ganon build-custom --input-file ganon_input_file.tsv --db-prefix mgnify_human_oral_v1 --taxonomy gtdb --level leaves --threads 32 Note MGnify genomes catalogues will be build with GTDB taxonomy.","title":"MGnify genome catalogues (MAGs)"},{"location":"custom_databases/#pathogen-detection-fda-argos","text":"A collection of >1400 \"microbes that include biothreat microorganisms, common clinical pathogens and closely related species\". Article / Website / BioProject . # Download sequence files wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt grep \"strain=FDAARGOS\" assembly_summary_refseq.txt > fdaargos_assembly_summary.txt genome_updater.sh -e fdaargos_assembly_summary.txt -f \"genomic.fna.gz\" -o download -m -t 12 # Build ganon database ganon build-custom --input download/ --input-recursive --db-prefix fdaargos --ncbi-file-info download/assembly_summary.txt --level assembly --threads 32 Note The example above uses genome_updater to download files","title":"Pathogen detection FDA-ARGOS"},{"location":"custom_databases/#blast-databases-nt-env_nt-nt_prok","text":"BLAST databases. Website / FTP . Current available nucleotide databases (2023-05-04): 16S_ribosomal_RNA 18S_fungal_sequences 28S_fungal_sequences Betacoronavirus env_nt human_genome ITS_eukaryote_sequences ITS_RefSeq_Fungi LSU_eukaryote_rRNA LSU_prokaryote_rRNA mito mouse_genome nt nt_euk nt_others nt_prok nt_viruses patnt pdbnt ref_euk_rep_genomes ref_prok_rep_genomes refseq_rna refseq_select_rna ref_viroids_rep_genomes ref_viruses_rep_genomes SSU_eukaryote_rRNA tsa_nt Note List currently available nucleotide databases curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"nucl-metadata.json\" | sed 's/-nucl-metadata.json/, /g' | sort Warning Some BLAST databases are very big and may require extreme computational resources to build. You may need to use some reduction strategies The example below extracts sequences and information from a BLAST db to build a ganon database: # Define BLAST db db=\"16S_ribosomal_RNA\" threads=8 # Download BLAST db - re-run this command many times until all finish (no more output) curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"^${db}\\..*tar.gz$\" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress \"https://ftp.ncbi.nlm.nih.gov/blast/db/{}\" # OPTIONAL Download and check MD5 wget -O - -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}\\.*tar.gz.md5\" > \"${db}.md5\" find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads:-1} -I{} md5sum {} > \"${db}_downloaded.md5\" diff -sy <(sort -k 2,2 \"${db}.md5\") <(sort -k 2,2 \"${db}_downloaded.md5\") # Should print \"Files /dev/fd/xx and /dev/fd/xx are identical\" # Extract BLAST db files, if successful, remove .tar.gz find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > \"${db}_extracted_files.txt\" # Create folder to write sequence files (split into 10 sub-folders) seq 0 9 | xargs -i mkdir -p \"${db}\"/{} # This command extracts sequences from the blastdb and writes them into taxid specific files # It also generates the --input-file for ganon blastdbcmd -entry all -db \"${db}\" -outfmt \"%a %T %s\" | \\ awk -v db=\"$(realpath ${db})\" '{file=db\"/\"substr($2,1,1)\"/\"$2\".fna\"; print \">\"$1\"\\n\"$3 >> file; print file\"\\t\"$2\"\\t\"$2}' | \\ sort | uniq > \"${db}_ganon_input_file.tsv\" # Build ganon database ganon build-custom --input-file \"${db}_ganon_input_file.tsv\" --db-prefix \"${db}\" --threads 12 # Delete extracted files and auxiliary files cat \"${db}_extracted_files.txt\" | xargs rm rm \"${db}_extracted_files.txt\" \"${db}.md5\" \"${db}_downloaded.md5\" # Delete sequences rm -rf \"${db}\" \"${db}_ganon_input_file.tsv\" Note blastdbcmd is a command from BLAST+ software suite (tested version 2.14.0) and should be installed separately.","title":"BLAST databases (nt env_nt nt_prok ...)"},{"location":"custom_databases/#files-from-genome_updater","text":"To create a ganon database from files previosly downloaded with genome_updater : ganon build-custom --input output_folder_genome_updater/version/ --input-recursive --db-prefix mydb --ncbi-file-info output_folder_genome_updater/assembly_summary.txt --level assembly --threads 32","title":"Files from genome_updater"},{"location":"custom_databases/#parameter-details","text":"","title":"Parameter details"},{"location":"custom_databases/#false-positive-and-size-max-fp-filter-size","text":"ganon indices are based on bloom filters and can have false positive matches. This can be controlled with --max-fp parameter. The lower the --max-fp , the less chances of false positives matches on classification, but the larger the database size will be. For example, with --max-fp 0.01 the database will be build so any target (defined by --level ) will have 1 in a 100 change of reporting a false k-mer match. The false positive of the query (all k-mers of a read) will be way lower, but directly affected by this value. Alternatively, one can set a specific size for the final index with --filter-size . When using this option, please observe the theoretic false positive of the index reported at the end of the building process.","title":"False positive and size (--max-fp, --filter-size)"},{"location":"custom_databases/#minimizers-window-size-kmer-size","text":"in ganon build , when --window-size > --kmer-size minimizers are used. That means that for a every window, a single k-mer will be selected. It produces smaller database files and requires substantially less memory overall. It may increase building times but will have a huge benefit for classification times. Sensitivity and precision can be reduced by small margins. If --window-size = --kmer-size , all k-mers are going to be used to build the database.","title":"minimizers (--window-size, --kmer-size)"},{"location":"custom_databases/#target-file-or-sequence-input-target","text":"Customized builds can be done either by file or sequence. --input-target file will consider every file provided with --input a single unit. --input-target sequence will use every sequence as a unit. --input-target file is the default behavior and most efficient way to build databases. --input-target sequence should only be used when the input sequences are stored in a single file or when classification at sequence level is desired.","title":"Target file or sequence (--input-target)"},{"location":"custom_databases/#build-level-level","text":"The --level parameter defines the max. depth of the database for classification. This parameter is relevant because the --max-fp is going to be guaranteed at the --level chosen. By default, the level will be the same as --input-target , meaning that classification will be done either at file or sequence level. Alternatively, --level assembly will link the file or sequence target information with assembly accessions retrieved from NCBI servers. --level leaves or --level species (or genus, family, ...) will link the targets with taxonomic information and prune the tree at the chosen level. --level custom will use specialization level define in the --input-file .","title":"Build level (--level)"},{"location":"custom_databases/#genome-sizes-genome-size-files","text":"Ganon will automatically download auxiliary files to define an approximate genome size for each entry in the taxonomic tree. For --taxonomy ncbi the species_genome_size.txt.gz is used. For --taxonomy gtdb the *_metadata.tar.gz files are used. Those files can be directly provided with the --genome-size-files argument. Genome sizes of parent nodes are calculated as the average of the respective children nodes. Other nodes without direct assigned genome sizes will use the closest parent with a pre-calculated genome size. The genome sizes are stored in the ganon database .","title":"Genome sizes (--genome-size-files)"},{"location":"custom_databases/#retrieving-info-ncbi-sequence-info-ncbi-file-info","text":"Further taxonomy and assembly linking information has to be collected to properly build the database. --ncbi-sequence-info and --ncbi-file-info allow customizations on this step. When --input-target sequence , --ncbi-sequence-info argument allows the use of NCBI e-utils webservices ( eutils ) or downloads accession2taxid files to extract target information (options nucl_gb nucl_wgs nucl_est nucl_gss pdb prot dead_nucl dead_wgs dead_prot ). By default, ganon uses eutils up-to 50000 input sequences, otherwise it downloads nucl_gb nucl_wgs from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/. Previously downloaded files can be directly provided with this argument. When --input-target file , --ncbi-file-info uses assembly_summary.txt from https://ftp.ncbi.nlm.nih.gov/genomes/ to extract target information (options refseq genbank refseq_historical genbank_historical . Previously downloaded files can be directly provided with this argument. If you are using outdated, removed or inactive assembly or sequence files and accessions from NCBI, make sure to include dead_nucl dead_wgs for --ncbi-sequence-info or refseq_historical genbank_historical for --ncbi-file-info . eutils option does not work with outdated accessions.","title":"Retrieving info (--ncbi-sequence-info, --ncbi-file-info)"},{"location":"default_databases/","text":"Databases :) ganon automates the download, update and build of databases based on NCBI RefSeq and GenBank genomes repositories wtih ganon build and update commands, for example: ganon build -g archaea bacteria -d arc_bac -c -t 30 This will download archaeal and bacterial complete genomes from RefSeq and build a database with 30 threads. Some day later, the database can be updated to include newest genomes with: ganon update -d arc_bac -t 30 Additionally, custom databases can be built with customized files and identifiers with the ganon build-custom command. Info We DO NOT provide pre-built indices for download. ganon can build databases very efficiently. This way, you will always have up-to-date reference sequences and get most out of your data. RefSeq and GenBank :) NCBI RefSeq and GenBank repositories are common resources to obtain reference sequences to analyze metagenomics data. They are mainly divided into domains/organism groups (e.g. archaea, bacteria, fungi, ...) but can be further filtered. The choice of those filters can drastically change the outcome of results. Commonly used sub-sets :) RefSeq (2023-03-14) # assemblies # species Size* ganon build All genomes 295219 52781 160 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs All genomes - 1 assembly/species 52781 52781 128 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_t1s Complete genomes 44121 19715 35 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_rs_cg Complete genomes - 1 assembly/species 19715 19715 29 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_cg_t1s Representative genomes 18073 18073 69 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --representative-genomes --db-prefix abfv_rs_rg GenBank (2023-03-14) # assemblies # species Size* ganon build All genomes 1595845 99505 - ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb All genomes - 1 assembly/species 99505 99505 300 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_gb_t1s Complete genomes 92917 34815 42 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg Complete genomes - 1 assembly/species 34815 34815 34 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes \"-A 'species:1'\" --db-prefix abfv_gb_cg_t1s Info Data obtained in 2023-03-14 for archaea, bacteria, fungi and viral groups only. By the time you are reading this, those numbers certainly grew a bit. The commands provided will download up-to-date assemblies and will require slightly larger resources. GTDB R214 # assemblies # species Size* ganon build All genomes 402709 85205 260 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --db-prefix ab_gtdb All genomes - 1 assembly/species 85205 85205 213 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --top 1 --db-prefix ab_gtdb_t1s Info GTDB covers only bacteria and archaea groups and has assemblies from RefSeq and GenBank. * in GB -> ganon requires up-to 2x the database size of memory to build it. The memory required to use it in classification is approx. the same as the database size. As a rule of thumb, the more the better, so choose the most comprehensive sub-set as possible given your computational resources It is possible to build databases that consume a fixed size/RAM usage. Beware that smaller filters will increase the false positive rates when classifying. Other approaches can reduce the size/RAM requirements with some trade-offs . Alternatively, you can build one database for each organism group separately and use them in ganon classify in any order or even stack them hierarchically . This way combination of multiple databases are possible, extending use cases. Further examples of commonly used database can be found here . Specific organisms or taxonomic groups :) It is also possible to generate databases for specific organisms or taxonomic branches with -a/--taxid , for example: ganon build --source refseq --taxid 562 317 --threads 48 --db-prefix coli_syringae will download and build a database for all Escherichia coli (taxid:562) and Pseudomonas syringae (taxid:317) assemblies from RefSeq. More filter options :) ganon uses genome_updater to manage downloads and further specific options and filters can be provided with the paramer -u/--genome-updater , for example: ganon build -g bacteria -t 48 -d bac_refseq --genome-updater \"-A 'genus:3' -E 20230101\" will download top 3 archaeal assemblies for each genus with date before 2023-01-01. For more information about genome_updater parameters, please check the repository . GTDB :) By default, ganon will use the NCBI Taxonomy to build the database. However, GTDB is fully supported and can be used with the parameter --taxonomy gtdb . Filtering by taxonomic entries also work with GTDB, for example: ganon build --db-prefix fuso_gtdb --taxid \"f__Fusobacteriaceae\" --source refseq genbank --taxonomy gtdb --threads 12 Update (ganon update) :) Default ganon databases generated with the ganon build can be updated with ganon update . This procedure will download new files and re-generate the ganon database with the updated entries. For example, a database generated with the following command: ganon build --db-prefix arc_cg_rs --source refseq --organism-group archaea --complete-genomes --threads 12 will contain all archaeal complete genomes from NCBI RefSeq at the time of running. Some days later, the database can be updated, fetching only new sequences added to the NCBI repository with the command: ganon update --db-prefix arc_cg_rs --threads 12 Tip To not overwrite the current database and create a new one with the updated files, use the --output-db-prefix parameter. Reproducibility :) If you use ganon with default databases and want to re-generate it later or keep track of the content for reproducibility purposes, you can save the assembly_summary.txt file located inside the {output_prefix}_files/ directory. To re-download the exact same snapshot of files used, one could use genome_updater , for example: genome_updater.sh -e assembly_summary.txt -f \"genomic.fna.gz\" -o recovered_files -m -t 12 Reducing database size :) Filter type (IBF and HIBF) :) The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates smaller databases with faster query times ( article ). However, the HIBF takes longer to build and has less flexibility regarding size and further options in ganon. You can choose which filter to use with the --filter-type parameter in ganon build and ganon build-custom - Due to differences between the default IBF used in ganon and the HIBF, it is recommended to lower the false positive when using the HIBF. The default value for high sensitivity is 1% ( --filter-type hibf --max-fp 0.001 ). Hint For large unbalanced reference sets, lots of reads to query -> HIBF (default) For quick database build and more flexibility -> IBF False positive rate :) A higher --max-fp value will generate a smaller database but with a higher number of false positive matches on classification. More details . Values between 0.001 (0.1%) and 0.3 (30%) are generally used. Hint When using higher --max-fp values, more false positive results may be generated. This can be filtered with the --fpr-query parameter in ganon classify k-mer and window size :) Define how much unique information is stored in the database. More details The smaller the --kmer-size , the less unique they will be, reducing database size but also sensitivity in classification. The bigger the --window-size , the less information needs to be stored resulting in smaller databases but with decrease classification accuracy. Top assemblies :) RefSeq and GenBank are highly biased toward some few organisms. This means that some species are highly represented in number of assemblies compared to others. This can not only bias analysis but also brings redundancy to the database. Choosing a certain number of top assemblies can mitigate those issues. Database sizes can also be drastically reduced without this redundancy, but \"strain-level\" analysis are then not possible. We recommend using top assemblies for larger and comprehensive reference sets (like the ones listed above ) and use the full set of assemblies for specific clade analysis. Example ganon build --top 1 will select one assembly for each taxonomic leaf (NCBI taxonomy still has strain, sub-species, ...) ganon build --genome-updater \"-A 'species:1'\" will select one assembly for each species ganon build --genome-updater \"-A 'genus:3'\" will select three assemblies for each genus Split databases :) Ganon allows classification with multiple databases in one level or in an hierarchy ( More details ). This means that databases can be built separately and used in any combination as desired. There are usually some benefits of doing so: Smaller databases when building by organism group, for example: one for bacteria, another for viruses, ... since average genome sizes are quite different. Easier to maintain and update. Extend use cases and avoid misclassification due to contaminated databases. Use databases as quality control, for example: remove reads matching one database of host or vectors (check out ganon report --skip-hierarchy ). Fixed size and Mode (only for --filter-type ibf) :) A fixed size for the database filter can be defined with --filter-size when using --filter-type ibf . The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. More details . --mode offers 5 different categories to build a database controlling the trade-off between size and classification speed. avg : Balanced mode smaller or smallest : create smaller databases with slower classification speed fast or fastest : create bigger databases with faster classification speed Warning If --filter-size is used, smaller and smallest refers to the false positive and not to the database size (which is fixed). Example :) Besides the benefits of using HIBF and specific sub-sets of big repositories shown on the default databases table , examples of other reduction strategies with IBF can be seen below: RefSeq archaeal complete genomes from 2023-05-05 Strategy Size (MB) Smaller Trade-off default 318 - - cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --db-prefix arc_rs_cg --filter-type ibf --mode smallest 301 5% Slower classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --mode smallest --db-prefix arc_rs_cg_smallest --filter-type ibf --filter-size 256 256 19% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --filter-size 256 --db-prefix arc_rs_cg_fs256 --filter-type ibf --window-size 35 249 21% Less sensitive classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --window-size 35 --db-prefix arc_rs_cg_ws35 --filter-type ibf --max-fp 0.2 190 40% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --max-fp 0.2 --db-prefix arc_rs_cg_fp0.2 --filter-type ibf Note This is an illustrative example and the reduction proportions for different configuration may be quite different","title":"Databases (ganon build)"},{"location":"default_databases/#databases","text":"ganon automates the download, update and build of databases based on NCBI RefSeq and GenBank genomes repositories wtih ganon build and update commands, for example: ganon build -g archaea bacteria -d arc_bac -c -t 30 This will download archaeal and bacterial complete genomes from RefSeq and build a database with 30 threads. Some day later, the database can be updated to include newest genomes with: ganon update -d arc_bac -t 30 Additionally, custom databases can be built with customized files and identifiers with the ganon build-custom command. Info We DO NOT provide pre-built indices for download. ganon can build databases very efficiently. This way, you will always have up-to-date reference sequences and get most out of your data.","title":"Databases"},{"location":"default_databases/#refseq-and-genbank","text":"NCBI RefSeq and GenBank repositories are common resources to obtain reference sequences to analyze metagenomics data. They are mainly divided into domains/organism groups (e.g. archaea, bacteria, fungi, ...) but can be further filtered. The choice of those filters can drastically change the outcome of results.","title":"RefSeq and GenBank"},{"location":"default_databases/#commonly-used-sub-sets","text":"RefSeq (2023-03-14) # assemblies # species Size* ganon build All genomes 295219 52781 160 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs All genomes - 1 assembly/species 52781 52781 128 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_t1s Complete genomes 44121 19715 35 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_rs_cg Complete genomes - 1 assembly/species 19715 19715 29 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_cg_t1s Representative genomes 18073 18073 69 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --representative-genomes --db-prefix abfv_rs_rg GenBank (2023-03-14) # assemblies # species Size* ganon build All genomes 1595845 99505 - ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb All genomes - 1 assembly/species 99505 99505 300 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_gb_t1s Complete genomes 92917 34815 42 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg Complete genomes - 1 assembly/species 34815 34815 34 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes \"-A 'species:1'\" --db-prefix abfv_gb_cg_t1s Info Data obtained in 2023-03-14 for archaea, bacteria, fungi and viral groups only. By the time you are reading this, those numbers certainly grew a bit. The commands provided will download up-to-date assemblies and will require slightly larger resources. GTDB R214 # assemblies # species Size* ganon build All genomes 402709 85205 260 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --db-prefix ab_gtdb All genomes - 1 assembly/species 85205 85205 213 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --top 1 --db-prefix ab_gtdb_t1s Info GTDB covers only bacteria and archaea groups and has assemblies from RefSeq and GenBank. * in GB -> ganon requires up-to 2x the database size of memory to build it. The memory required to use it in classification is approx. the same as the database size. As a rule of thumb, the more the better, so choose the most comprehensive sub-set as possible given your computational resources It is possible to build databases that consume a fixed size/RAM usage. Beware that smaller filters will increase the false positive rates when classifying. Other approaches can reduce the size/RAM requirements with some trade-offs . Alternatively, you can build one database for each organism group separately and use them in ganon classify in any order or even stack them hierarchically . This way combination of multiple databases are possible, extending use cases. Further examples of commonly used database can be found here .","title":"Commonly used sub-sets"},{"location":"default_databases/#specific-organisms-or-taxonomic-groups","text":"It is also possible to generate databases for specific organisms or taxonomic branches with -a/--taxid , for example: ganon build --source refseq --taxid 562 317 --threads 48 --db-prefix coli_syringae will download and build a database for all Escherichia coli (taxid:562) and Pseudomonas syringae (taxid:317) assemblies from RefSeq.","title":"Specific organisms or taxonomic groups"},{"location":"default_databases/#more-filter-options","text":"ganon uses genome_updater to manage downloads and further specific options and filters can be provided with the paramer -u/--genome-updater , for example: ganon build -g bacteria -t 48 -d bac_refseq --genome-updater \"-A 'genus:3' -E 20230101\" will download top 3 archaeal assemblies for each genus with date before 2023-01-01. For more information about genome_updater parameters, please check the repository .","title":"More filter options"},{"location":"default_databases/#gtdb","text":"By default, ganon will use the NCBI Taxonomy to build the database. However, GTDB is fully supported and can be used with the parameter --taxonomy gtdb . Filtering by taxonomic entries also work with GTDB, for example: ganon build --db-prefix fuso_gtdb --taxid \"f__Fusobacteriaceae\" --source refseq genbank --taxonomy gtdb --threads 12","title":"GTDB"},{"location":"default_databases/#update-ganon-update","text":"Default ganon databases generated with the ganon build can be updated with ganon update . This procedure will download new files and re-generate the ganon database with the updated entries. For example, a database generated with the following command: ganon build --db-prefix arc_cg_rs --source refseq --organism-group archaea --complete-genomes --threads 12 will contain all archaeal complete genomes from NCBI RefSeq at the time of running. Some days later, the database can be updated, fetching only new sequences added to the NCBI repository with the command: ganon update --db-prefix arc_cg_rs --threads 12 Tip To not overwrite the current database and create a new one with the updated files, use the --output-db-prefix parameter.","title":"Update (ganon update)"},{"location":"default_databases/#reproducibility","text":"If you use ganon with default databases and want to re-generate it later or keep track of the content for reproducibility purposes, you can save the assembly_summary.txt file located inside the {output_prefix}_files/ directory. To re-download the exact same snapshot of files used, one could use genome_updater , for example: genome_updater.sh -e assembly_summary.txt -f \"genomic.fna.gz\" -o recovered_files -m -t 12","title":"Reproducibility"},{"location":"default_databases/#reducing-database-size","text":"","title":"Reducing database size"},{"location":"default_databases/#filter-type-ibf-and-hibf","text":"The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates smaller databases with faster query times ( article ). However, the HIBF takes longer to build and has less flexibility regarding size and further options in ganon. You can choose which filter to use with the --filter-type parameter in ganon build and ganon build-custom - Due to differences between the default IBF used in ganon and the HIBF, it is recommended to lower the false positive when using the HIBF. The default value for high sensitivity is 1% ( --filter-type hibf --max-fp 0.001 ). Hint For large unbalanced reference sets, lots of reads to query -> HIBF (default) For quick database build and more flexibility -> IBF","title":"Filter type (IBF and HIBF)"},{"location":"default_databases/#false-positive-rate","text":"A higher --max-fp value will generate a smaller database but with a higher number of false positive matches on classification. More details . Values between 0.001 (0.1%) and 0.3 (30%) are generally used. Hint When using higher --max-fp values, more false positive results may be generated. This can be filtered with the --fpr-query parameter in ganon classify","title":"False positive rate"},{"location":"default_databases/#k-mer-and-window-size","text":"Define how much unique information is stored in the database. More details The smaller the --kmer-size , the less unique they will be, reducing database size but also sensitivity in classification. The bigger the --window-size , the less information needs to be stored resulting in smaller databases but with decrease classification accuracy.","title":"k-mer and window size"},{"location":"default_databases/#top-assemblies","text":"RefSeq and GenBank are highly biased toward some few organisms. This means that some species are highly represented in number of assemblies compared to others. This can not only bias analysis but also brings redundancy to the database. Choosing a certain number of top assemblies can mitigate those issues. Database sizes can also be drastically reduced without this redundancy, but \"strain-level\" analysis are then not possible. We recommend using top assemblies for larger and comprehensive reference sets (like the ones listed above ) and use the full set of assemblies for specific clade analysis. Example ganon build --top 1 will select one assembly for each taxonomic leaf (NCBI taxonomy still has strain, sub-species, ...) ganon build --genome-updater \"-A 'species:1'\" will select one assembly for each species ganon build --genome-updater \"-A 'genus:3'\" will select three assemblies for each genus","title":"Top assemblies"},{"location":"default_databases/#split-databases","text":"Ganon allows classification with multiple databases in one level or in an hierarchy ( More details ). This means that databases can be built separately and used in any combination as desired. There are usually some benefits of doing so: Smaller databases when building by organism group, for example: one for bacteria, another for viruses, ... since average genome sizes are quite different. Easier to maintain and update. Extend use cases and avoid misclassification due to contaminated databases. Use databases as quality control, for example: remove reads matching one database of host or vectors (check out ganon report --skip-hierarchy ).","title":"Split databases"},{"location":"default_databases/#fixed-size-and-mode-only-for-filter-type-ibf","text":"A fixed size for the database filter can be defined with --filter-size when using --filter-type ibf . The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. More details . --mode offers 5 different categories to build a database controlling the trade-off between size and classification speed. avg : Balanced mode smaller or smallest : create smaller databases with slower classification speed fast or fastest : create bigger databases with faster classification speed Warning If --filter-size is used, smaller and smallest refers to the false positive and not to the database size (which is fixed).","title":"Fixed size and Mode (only for --filter-type ibf)"},{"location":"default_databases/#example","text":"Besides the benefits of using HIBF and specific sub-sets of big repositories shown on the default databases table , examples of other reduction strategies with IBF can be seen below: RefSeq archaeal complete genomes from 2023-05-05 Strategy Size (MB) Smaller Trade-off default 318 - - cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --db-prefix arc_rs_cg --filter-type ibf --mode smallest 301 5% Slower classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --mode smallest --db-prefix arc_rs_cg_smallest --filter-type ibf --filter-size 256 256 19% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --filter-size 256 --db-prefix arc_rs_cg_fs256 --filter-type ibf --window-size 35 249 21% Less sensitive classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --window-size 35 --db-prefix arc_rs_cg_ws35 --filter-type ibf --max-fp 0.2 190 40% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --max-fp 0.2 --db-prefix arc_rs_cg_fp0.2 --filter-type ibf Note This is an illustrative example and the reduction proportions for different configuration may be quite different","title":"Example"},{"location":"outputfiles/","text":"Output files :) ganon build/build-custom/update :) Every run on ganon build , ganon build-custom or ganon update will generate the following database files: {prefix} .ibf/.hibf : main bloom filter index file, extension based on the --filter-type option. {prefix} .tax : taxonomy tree, only generated if --taxonomy is used (fields: target/node, parent, rank, name, genome size) . {prefix} _files/ : ( ganon build only) folder containing downloaded reference sequence and auxiliary files. Not necessary for classification. Keep this folder if the database will be update later. Otherwise it can be deleted. Warning Database files generated with version 1.2.0 or higher are not compatible with older versions. ganon classify :) {prefix} .tre : full report file (see below) {prefix} .rep : plain report of the run with only targets that received a match. Can be used to re-generate full reports (.tre) with ganon report . At the end prints 2 extra lines with #total_classified and #total_unclassified . Fields 1: hierarchy label 2: target 3: # total matches 4: # unique reads 5: # lca reads 6: rank 7: name {prefix} .one : output with one match for each classified read after EM or LCA algorithm. Only generated with --output-one active. If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.one (fields: read identifier, target, (max) k-mer/minimizer count) {prefix} .all : output with all matches for each read. Only generated with --output-all active Warning: file can be very large . If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.all (fields: read identifier, target, k-mer/minimizer count) ganon report :) {prefix} .tre : tab-separated tree-like report with cumulative counts and taxonomic lineage. There are several possible --report-type . More information on the different types of reports can be found here : abundance : will attempt to estimate taxonomic abundances by re-disributing read counts from LCA matches and correcting sequence abundance by approximate genome sizes. reads : sequence abundances , reports the proportion of sequences assigned to a taxa, each read classified is counted once. dist : like reads with read count re-distribution. corr : like reads with correction by genome size. matches : every match is reported to their original target, including multiple and shared matches. Each line in this report is a taxonomic entry (including the root node), with the following fields: col field obs example 1 rank phylum 2 target taxonomic id. or specialization (assembly id.) 562 3 lineage 1|131567|2|1224|28211|766|942|768|769 4 name Chromobacterium rhizoryzae 5 # unique number of reads that matched exclusively to this target 5 6 # shared number of reads with non-unique matches directly assigned to this target. Represents the LCA matches ( --report-type reads ), re-assigned matches ( --report-type abundance/dist ) or shared matches ( --report-type matches ) 10 7 # children number of unique and shared assignments to all children nodes of this target 20 8 # cumulative the sum of the unique, shared and children assignments up-to this target 35 9 % cumulative percentage of assignments or estimated relative abundance for --report-type abundance 43.24 The first line of the report file will show the number of unclassified reads (not for --report-type matches ) The CAMI challenge bioboxes profiling format is supported using --output-format bioboxes . In this format, only values for the percentage/abundance (col. 9) are reported. The root node and unclassified entries are omitted. The sum of cumulative assignments for the unclassified and root lines is 100%. The final cumulative sum of reads/matches may be under 100% if any filter is successfully applied and/or hierarchical selection is selected (keep/skip/split). For all report type but matches , only taxa that received direct read matches, either unique or by LCA assignment, are considered. Some reads may have only shared matches and will not be reported directly but will be accounted for on some parent level. To visualize those matches, create a report with --report-type matches or use directly the file {prefix} .rep . ganon table :) {output_file}: a tab-separated file with counts/percentages of taxa for multiple samples Examples of output files The main output file is the `{prefix}.tre` which will summarize the results: unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 superkingdom 2 1|2 Bacteria 0 0 97 97 97.97980 phylum 1239 1|2|1239 Firmicutes 0 0 57 57 57.57576 phylum 1224 1|2|1224 Proteobacteria 0 0 40 40 40.40404 class 91061 1|2|1239|91061 Bacilli 0 0 57 57 57.57576 class 28211 1|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 class 1236 1|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 1385 1|2|1239|91061|1385 Bacillales 0 0 57 57 57.57576 order 204458 1|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 order 72274 1|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 186822 1|2|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 family 76892 1|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 family 468 1|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 44249 1|2|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 genus 75 1|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 genus 469 1|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species 1406 1|2|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 species 366602 1|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 species 470 1|2|1224|1236|72274|468|469|470 Acinetobacter baumannii 12 0 0 12 12.12121 running `ganon classify` or `ganon report` with `--ranks all`, the output will show all ranks used for classification and presented sorted by lineage (also available with `ganon report --sort lineage`): unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 no rank 131567 1|131567 cellular organisms 0 0 97 97 97.97980 superkingdom 2 1|131567|2 Bacteria 0 0 97 97 97.97980 phylum 1224 1|131567|2|1224 Proteobacteria 0 0 40 40 40.40404 class 1236 1|131567|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 72274 1|131567|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 468 1|131567|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 469 1|131567|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species group 909768 1|131567|2|1224|1236|72274|468|469|909768 Acinetobacter calcoaceticus/baumannii complex 0 0 12 12 12.12121 species 470 1|131567|2|1224|1236|72274|468|469|909768|470 Acinetobacter baumannii 12 0 0 12 12.12121 class 28211 1|131567|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 order 204458 1|131567|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 family 76892 1|131567|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 genus 75 1|131567|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 species 366602 1|131567|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 no rank 1783272 1|131567|2|1783272 Terrabacteria group 0 0 57 57 57.57576 phylum 1239 1|131567|2|1783272|1239 Firmicutes 0 0 57 57 57.57576 class 91061 1|131567|2|1783272|1239|91061 Bacilli 0 0 57 57 57.57576 order 1385 1|131567|2|1783272|1239|91061|1385 Bacillales 0 0 57 57 57.57576 family 186822 1|131567|2|1783272|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 genus 44249 1|131567|2|1783272|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 species 1406 1|131567|2|1783272|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 with `--output-format bioboxes` @Version:0.10.0 @SampleID:example.rep H1 @Ranks:superkingdom|phylum|class|order|family|genus|species|assembly @Taxonomy:db.tax @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 100.00000 1224 phylum 2|1224 Bacteria|Proteobacteria 56.89782 201174 phylum 2|201174 Bacteria|Actinobacteria 21.84869 1239 phylum 2|1239 Bacteria|Firmicutes 9.75197 976 phylum 2|976 Bacteria|Bacteroidota 6.15297 1117 phylum 2|1117 Bacteria|Cyanobacteria 2.23146 203682 phylum 2|203682 Bacteria|Planctomycetota 1.23353 57723 phylum 2|57723 Bacteria|Acidobacteria 0.52549 200795 phylum 2|200795 Bacteria|Chloroflexi 0.31118","title":"Output files"},{"location":"outputfiles/#output-files","text":"","title":"Output files"},{"location":"outputfiles/#ganon-buildbuild-customupdate","text":"Every run on ganon build , ganon build-custom or ganon update will generate the following database files: {prefix} .ibf/.hibf : main bloom filter index file, extension based on the --filter-type option. {prefix} .tax : taxonomy tree, only generated if --taxonomy is used (fields: target/node, parent, rank, name, genome size) . {prefix} _files/ : ( ganon build only) folder containing downloaded reference sequence and auxiliary files. Not necessary for classification. Keep this folder if the database will be update later. Otherwise it can be deleted. Warning Database files generated with version 1.2.0 or higher are not compatible with older versions.","title":"ganon build/build-custom/update"},{"location":"outputfiles/#ganon-classify","text":"{prefix} .tre : full report file (see below) {prefix} .rep : plain report of the run with only targets that received a match. Can be used to re-generate full reports (.tre) with ganon report . At the end prints 2 extra lines with #total_classified and #total_unclassified . Fields 1: hierarchy label 2: target 3: # total matches 4: # unique reads 5: # lca reads 6: rank 7: name {prefix} .one : output with one match for each classified read after EM or LCA algorithm. Only generated with --output-one active. If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.one (fields: read identifier, target, (max) k-mer/minimizer count) {prefix} .all : output with all matches for each read. Only generated with --output-all active Warning: file can be very large . If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.all (fields: read identifier, target, k-mer/minimizer count)","title":"ganon classify"},{"location":"outputfiles/#ganon-report","text":"{prefix} .tre : tab-separated tree-like report with cumulative counts and taxonomic lineage. There are several possible --report-type . More information on the different types of reports can be found here : abundance : will attempt to estimate taxonomic abundances by re-disributing read counts from LCA matches and correcting sequence abundance by approximate genome sizes. reads : sequence abundances , reports the proportion of sequences assigned to a taxa, each read classified is counted once. dist : like reads with read count re-distribution. corr : like reads with correction by genome size. matches : every match is reported to their original target, including multiple and shared matches. Each line in this report is a taxonomic entry (including the root node), with the following fields: col field obs example 1 rank phylum 2 target taxonomic id. or specialization (assembly id.) 562 3 lineage 1|131567|2|1224|28211|766|942|768|769 4 name Chromobacterium rhizoryzae 5 # unique number of reads that matched exclusively to this target 5 6 # shared number of reads with non-unique matches directly assigned to this target. Represents the LCA matches ( --report-type reads ), re-assigned matches ( --report-type abundance/dist ) or shared matches ( --report-type matches ) 10 7 # children number of unique and shared assignments to all children nodes of this target 20 8 # cumulative the sum of the unique, shared and children assignments up-to this target 35 9 % cumulative percentage of assignments or estimated relative abundance for --report-type abundance 43.24 The first line of the report file will show the number of unclassified reads (not for --report-type matches ) The CAMI challenge bioboxes profiling format is supported using --output-format bioboxes . In this format, only values for the percentage/abundance (col. 9) are reported. The root node and unclassified entries are omitted. The sum of cumulative assignments for the unclassified and root lines is 100%. The final cumulative sum of reads/matches may be under 100% if any filter is successfully applied and/or hierarchical selection is selected (keep/skip/split). For all report type but matches , only taxa that received direct read matches, either unique or by LCA assignment, are considered. Some reads may have only shared matches and will not be reported directly but will be accounted for on some parent level. To visualize those matches, create a report with --report-type matches or use directly the file {prefix} .rep .","title":"ganon report"},{"location":"outputfiles/#ganon-table","text":"{output_file}: a tab-separated file with counts/percentages of taxa for multiple samples Examples of output files The main output file is the `{prefix}.tre` which will summarize the results: unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 superkingdom 2 1|2 Bacteria 0 0 97 97 97.97980 phylum 1239 1|2|1239 Firmicutes 0 0 57 57 57.57576 phylum 1224 1|2|1224 Proteobacteria 0 0 40 40 40.40404 class 91061 1|2|1239|91061 Bacilli 0 0 57 57 57.57576 class 28211 1|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 class 1236 1|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 1385 1|2|1239|91061|1385 Bacillales 0 0 57 57 57.57576 order 204458 1|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 order 72274 1|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 186822 1|2|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 family 76892 1|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 family 468 1|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 44249 1|2|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 genus 75 1|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 genus 469 1|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species 1406 1|2|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 species 366602 1|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 species 470 1|2|1224|1236|72274|468|469|470 Acinetobacter baumannii 12 0 0 12 12.12121 running `ganon classify` or `ganon report` with `--ranks all`, the output will show all ranks used for classification and presented sorted by lineage (also available with `ganon report --sort lineage`): unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 no rank 131567 1|131567 cellular organisms 0 0 97 97 97.97980 superkingdom 2 1|131567|2 Bacteria 0 0 97 97 97.97980 phylum 1224 1|131567|2|1224 Proteobacteria 0 0 40 40 40.40404 class 1236 1|131567|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 72274 1|131567|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 468 1|131567|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 469 1|131567|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species group 909768 1|131567|2|1224|1236|72274|468|469|909768 Acinetobacter calcoaceticus/baumannii complex 0 0 12 12 12.12121 species 470 1|131567|2|1224|1236|72274|468|469|909768|470 Acinetobacter baumannii 12 0 0 12 12.12121 class 28211 1|131567|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 order 204458 1|131567|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 family 76892 1|131567|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 genus 75 1|131567|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 species 366602 1|131567|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 no rank 1783272 1|131567|2|1783272 Terrabacteria group 0 0 57 57 57.57576 phylum 1239 1|131567|2|1783272|1239 Firmicutes 0 0 57 57 57.57576 class 91061 1|131567|2|1783272|1239|91061 Bacilli 0 0 57 57 57.57576 order 1385 1|131567|2|1783272|1239|91061|1385 Bacillales 0 0 57 57 57.57576 family 186822 1|131567|2|1783272|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 genus 44249 1|131567|2|1783272|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 species 1406 1|131567|2|1783272|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 with `--output-format bioboxes` @Version:0.10.0 @SampleID:example.rep H1 @Ranks:superkingdom|phylum|class|order|family|genus|species|assembly @Taxonomy:db.tax @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 100.00000 1224 phylum 2|1224 Bacteria|Proteobacteria 56.89782 201174 phylum 2|201174 Bacteria|Actinobacteria 21.84869 1239 phylum 2|1239 Bacteria|Firmicutes 9.75197 976 phylum 2|976 Bacteria|Bacteroidota 6.15297 1117 phylum 2|1117 Bacteria|Cyanobacteria 2.23146 203682 phylum 2|203682 Bacteria|Planctomycetota 1.23353 57723 phylum 2|57723 Bacteria|Acidobacteria 0.52549 200795 phylum 2|200795 Bacteria|Chloroflexi 0.31118","title":"ganon table"},{"location":"reports/","text":"Reports :) ganon report filters and generates several reports and summaries from the results obtained with ganon classify . It is possible to summarize the results in terms of taxonomic and sequence abundances as well as total number of matches. Examples :) Given the output .rep from ganon classify and the database used ( --db-prefix ): Taxonomic profile with abundance estimation (default) :) ganon report --db-prefix mydb --input results.rep --output-prefix tax_profile --report-type abundance Sequence profile :) ganon report --db-prefix mydb --input results.rep --output-prefix seq_profile --report-type reads Matches profile :) ganon report --db-prefix mydb --input results.rep --output-prefix matches --report-type matches Filtering results :) ganon report --db-prefix mydb --input results.rep --output-prefix filtered --min-count 0.0005 --top-percentile 0.8 This will keep only results with a min. abundance of 0.05% and only the top 80% most abundant. Parameter details :) report type (--report-type) :) Several reports are available with --report-type : reads , abundance , dist , corr , matches : reads reports sequence abundances which are the basic proportion of reads classified in the sample. abundance will convert sequence abundance into taxonomic abundances by re-distributing read counts among leaf nodes and correcting by genome size. The re-distribution applies for reads classified with a LCA assignment and it is proportional to the number of unique matches of leaf nodes available in the ganon database (relative to the LCA node). Genome size is estimated based on NCBI or GTDB auxiliary files . Genome size correction is applied by rank based on default ranks only (superkingdom phylum class order family genus species assembly). Read counts in intermediate ranks will be corrected based on the closest parent default rank and re-assigned to its original rank. dist is the same of reads with read count re-distribution corr is the same of reads with correction by genome size matches will report the total number of matches classified, either unique or shared. This option will output the total number of matches instead the total number of reads","title":"Reports (ganon report)"},{"location":"reports/#reports","text":"ganon report filters and generates several reports and summaries from the results obtained with ganon classify . It is possible to summarize the results in terms of taxonomic and sequence abundances as well as total number of matches.","title":"Reports"},{"location":"reports/#examples","text":"Given the output .rep from ganon classify and the database used ( --db-prefix ):","title":"Examples"},{"location":"reports/#taxonomic-profile-with-abundance-estimation-default","text":"ganon report --db-prefix mydb --input results.rep --output-prefix tax_profile --report-type abundance","title":"Taxonomic profile with abundance estimation (default)"},{"location":"reports/#sequence-profile","text":"ganon report --db-prefix mydb --input results.rep --output-prefix seq_profile --report-type reads","title":"Sequence profile"},{"location":"reports/#matches-profile","text":"ganon report --db-prefix mydb --input results.rep --output-prefix matches --report-type matches","title":"Matches profile"},{"location":"reports/#filtering-results","text":"ganon report --db-prefix mydb --input results.rep --output-prefix filtered --min-count 0.0005 --top-percentile 0.8 This will keep only results with a min. abundance of 0.05% and only the top 80% most abundant.","title":"Filtering results"},{"location":"reports/#parameter-details","text":"","title":"Parameter details"},{"location":"reports/#report-type-report-type","text":"Several reports are available with --report-type : reads , abundance , dist , corr , matches : reads reports sequence abundances which are the basic proportion of reads classified in the sample. abundance will convert sequence abundance into taxonomic abundances by re-distributing read counts among leaf nodes and correcting by genome size. The re-distribution applies for reads classified with a LCA assignment and it is proportional to the number of unique matches of leaf nodes available in the ganon database (relative to the LCA node). Genome size is estimated based on NCBI or GTDB auxiliary files . Genome size correction is applied by rank based on default ranks only (superkingdom phylum class order family genus species assembly). Read counts in intermediate ranks will be corrected based on the closest parent default rank and re-assigned to its original rank. dist is the same of reads with read count re-distribution corr is the same of reads with correction by genome size matches will report the total number of matches classified, either unique or shared. This option will output the total number of matches instead the total number of reads","title":"report type (--report-type)"},{"location":"start/","text":"Quick Start Guide :) Install :) conda install -c bioconda -c conda-forge ganon Download and Build a database :) Bacteria - NCBI RefSeq - representative genomes ganon build --db-prefix bac_rs_rg --source refseq --organism-group bacteria --representative-genomes --threads 24 If you want to test ganon functionalities with a smaller database, use archaea instead of bacteria in the example above. Classify and generate a tax. profile :) Download test reads ganon classify --db-prefix bac_rs_rg --output-prefix classify_results --single-reads H01_1M_0.1.fq.gz --threads 24 classify_results.tre -> taxonomic profile Important parameters :) The most important parameters and trade-offs to be aware of when using ganon: ganon build :) --level : Highest level to build the database. Can be a taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis. The more specific the level, the bigger the database will be. --max-fp : controls the false positive of the bloom filters. The higher the --max-fp , the smaller the databases at a cost of sensitivity in classification. --window-size --kmer-size : the window value should always be the same or larger than the k-mer value. The larger the difference between them, the smaller the database will be. However, some sensitivity/precision loss in classification is expected with small k-mer and/or large window . Larger k-mer values (e.g. 31 ) will improve classification, specially read binning, at a cost of larger databases. ganon classify :) --rel-cutoff : defines the min. percentage of k-mers shared to a reference to consider a match. Higher values will improve precision and decrease sensitivity. For taxonomic profiling, a higher value between 0.4 and 0.8 may provide better results. For read binning, lower values between 0.2 and 0.4 are recommended. lower values -> more read matches higher values -> less read matches --rel-filter : filter matches in relation to the best and worst after the cutoff is applied. 0 means only matches with top score (# of k-mers ) as the best match will be kept. lower values -> more unique matching reads higher values -> more multi-matching reads --multiple-matches : defines how ganon treats multiple-matching reads. Either by an EM-algorithm based on unique matches or a taxonomy-based LCA algorithm. ganon report :) --report-type : reports either taxonomic, sequence or matches abundances. Use corr or abundance for taxonomic profiling, reads or dist for sequence profiling and matches to report a summary of all matches. --min-count : cutoff to discard underrepresented taxa. Useful to remove the common long tail of spurious matches and false positives when performing classification. Values between 0.0001 (0.01%) and 0.001 (0.1%) improved sensitivity and precision in our evaluations. The higher the value, the more precise the outcome, with a sensitivity loss. Alternatively --top-percentile can be used to keep a relative amount of taxa instead a hard cutoff. The numeric values above are averages from several experiments with different sample types and database contents. They may not work as expected for your data. If you are not sure which values to use or see something unexpected, please open an issue .","title":"Quick Start"},{"location":"start/#quick-start-guide","text":"","title":"Quick Start Guide"},{"location":"start/#install","text":"conda install -c bioconda -c conda-forge ganon","title":"Install"},{"location":"start/#download-and-build-a-database","text":"Bacteria - NCBI RefSeq - representative genomes ganon build --db-prefix bac_rs_rg --source refseq --organism-group bacteria --representative-genomes --threads 24 If you want to test ganon functionalities with a smaller database, use archaea instead of bacteria in the example above.","title":"Download and Build a database"},{"location":"start/#classify-and-generate-a-tax-profile","text":"Download test reads ganon classify --db-prefix bac_rs_rg --output-prefix classify_results --single-reads H01_1M_0.1.fq.gz --threads 24 classify_results.tre -> taxonomic profile","title":"Classify and generate a tax. profile"},{"location":"start/#important-parameters","text":"The most important parameters and trade-offs to be aware of when using ganon:","title":"Important parameters"},{"location":"start/#ganon-build","text":"--level : Highest level to build the database. Can be a taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis. The more specific the level, the bigger the database will be. --max-fp : controls the false positive of the bloom filters. The higher the --max-fp , the smaller the databases at a cost of sensitivity in classification. --window-size --kmer-size : the window value should always be the same or larger than the k-mer value. The larger the difference between them, the smaller the database will be. However, some sensitivity/precision loss in classification is expected with small k-mer and/or large window . Larger k-mer values (e.g. 31 ) will improve classification, specially read binning, at a cost of larger databases.","title":"ganon build"},{"location":"start/#ganon-classify","text":"--rel-cutoff : defines the min. percentage of k-mers shared to a reference to consider a match. Higher values will improve precision and decrease sensitivity. For taxonomic profiling, a higher value between 0.4 and 0.8 may provide better results. For read binning, lower values between 0.2 and 0.4 are recommended. lower values -> more read matches higher values -> less read matches --rel-filter : filter matches in relation to the best and worst after the cutoff is applied. 0 means only matches with top score (# of k-mers ) as the best match will be kept. lower values -> more unique matching reads higher values -> more multi-matching reads --multiple-matches : defines how ganon treats multiple-matching reads. Either by an EM-algorithm based on unique matches or a taxonomy-based LCA algorithm.","title":"ganon classify"},{"location":"start/#ganon-report","text":"--report-type : reports either taxonomic, sequence or matches abundances. Use corr or abundance for taxonomic profiling, reads or dist for sequence profiling and matches to report a summary of all matches. --min-count : cutoff to discard underrepresented taxa. Useful to remove the common long tail of spurious matches and false positives when performing classification. Values between 0.0001 (0.01%) and 0.001 (0.1%) improved sensitivity and precision in our evaluations. The higher the value, the more precise the outcome, with a sensitivity loss. Alternatively --top-percentile can be used to keep a relative amount of taxa instead a hard cutoff. The numeric values above are averages from several experiments with different sample types and database contents. They may not work as expected for your data. If you are not sure which values to use or see something unexpected, please open an issue .","title":"ganon report"},{"location":"table/","text":"Table :) ganon table filters and summarizes several reports obtained with ganon report into a table. Filters for each sample or for averages among all samples can also be applied. Examples :) Given several .tre from ganon report : Counts of species :) ganon table --input *.tre --output-file table.tsv --rank species Abundance of species :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species Top 10 species (among all samples) :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-all 10 Top 10 species (from each samples) :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-sample 10 Filtering results :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --min-count 0.0005 This will keep only results with a min. abundance of 0.05% .","title":"Table (ganon table)"},{"location":"table/#table","text":"ganon table filters and summarizes several reports obtained with ganon report into a table. Filters for each sample or for averages among all samples can also be applied.","title":"Table"},{"location":"table/#examples","text":"Given several .tre from ganon report :","title":"Examples"},{"location":"table/#counts-of-species","text":"ganon table --input *.tre --output-file table.tsv --rank species","title":"Counts of species"},{"location":"table/#abundance-of-species","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species","title":"Abundance of species"},{"location":"table/#top-10-species-among-all-samples","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-all 10","title":"Top 10 species (among all samples)"},{"location":"table/#top-10-species-from-each-samples","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-sample 10","title":"Top 10 species (from each samples)"},{"location":"table/#filtering-results","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --min-count 0.0005 This will keep only results with a min. abundance of 0.05% .","title":"Filtering results"},{"location":"tutorials/","text":"Tutorials :) ... soon ...","title":"Tutorials"},{"location":"tutorials/#tutorials","text":"... soon ...","title":"Tutorials"}]}
\ No newline at end of file
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"ganon :) Code: GitHub repository ganon2 pre-print ganon is designed to index large sets of genomic reference sequences and to classify reads against them efficiently. The tool uses Hierarchical Interleaved Bloom Filters as indices based on k-mers with optional minimizers. It was mainly developed, but not limited, to the metagenomics classification problem: quickly assign sequence fragments to their closest reference among thousands of references. After classification, taxonomic or sequence abundances are estimated and reported. Features :) integrated download and build of any subset from RefSeq/Genbank/GTDB with incremental updates NCBI and GTDB native support for taxonomic classification, custom taxonomy or no taxonomy at all customizable database build for local or non-standard sequence files optimized taxonomic binning and profiling configurations build and classify at various taxonomic levels, strain, assembly, file, sequence or custom specialization hierarchical classification using several databases in one or more levels in just one run EM and/or LCA algorithms to solve multiple-matching reads reporting of multiple and unique matches for every read reporting of sequence, taxonomic or multi-match abundances with optional genome size correction advanced tree-like reports with several filter options generation of contingency tables with several filters for multi-sample studies ganon achieved very good results in our own evaluations but also in independent evaluations: LEMMI , LEMMI v2 and CAMI2 Installation with conda :) The easiest way to install ganon is via conda, using the bioconda and conda-forge channels: conda install -c bioconda -c conda-forge ganon However, there are possible performance benefits compiling ganon from source in the target machine rather than using the conda version. To do so, please follow the instructions below: Installation from source :) Python dependencies :) python >=3.6 pandas >=1.2.0 multitax >=1.3.1 genome_updater >=0.6.3 # Python version should be >=3.6 python3 -V # Install packages via pip or conda: # PIP python3 -m pip install \"pandas>=1.2.0\" \"multitax>=1.3.1\" wget --quiet --show-progress https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh && chmod +x genome_updater.sh # Conda/Mamba (alternative) conda install -c bioconda -c conda-forge \"pandas>=1.2.0\" \"multitax>=1.3.1\" \"genome_updater>=0.6.3\" C++ dependencies :) GCC >=11 CMake >=3.4 zlib bzip2 raptor ==3.0.1 Tip If your system has GCC version 10 or below, you can create an environment with the latest conda-forge GCC version and dependencies: conda create -c conda-forge -n gcc-conda gcc gxx zlib bzip2 cmake and activate the environment with: source activate gcc-conda . In CMake, you may have set the environment include directory with the following parameter: -DSEQAN3_CXX_FLAGS=\"-I/path/to/miniconda3/envs/gcc-conda/include/\" changing /path/to/miniconda3 with your local path to the conda installation. Downloading and building ganon + submodules :) git clone --recurse-submodules https://github.com/pirovc/ganon.git # Install Python side cd ganon python3 setup.py install --record files.txt # optional # Compile and install C++ side mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DVERBOSE_CONFIG=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCONDA=OFF -DLONGREADS=OFF .. make -j 4 sudo make install # optional to change install location (e.g. /myprefix/bin/ ), set the installation prefix in the cmake command with -DCMAKE_INSTALL_PREFIX=/myprefix/ use -DINCLUDE_DIRS to set alternative paths to cxxopts and Catch2 libs. to classify extremely large reads or contigs that would need more than 65000 k-mers, use -DLONGREADS=ON Installing raptor :) The easiest way to install raptor is via conda with conda install -c bioconda -c conda-forge \"raptor=3.0.1\" (already included in ganon install via conda). Note raptor is required to build databases with the Hierarchical Interleaved Bloom Filter ( ganon build --filter-type hibf ) To build old style ganon indices ganon build --filter-type ibf , raptor is not required To install raptor from source, follow the instructions below: Dependencies :) CMake >= 3.18 GCC 11, 12 or 13 (most recent minor version) Downloading and building raptor + submodules :) git clone --branch raptor-v3.0.1 --recurse-submodules https://github.com/seqan/raptor cd raptor mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=\"-std=c++23 -Wno-interference-size\" .. make -j 4 binaries will be located in the bin directory you may have to inform ganon build the path to the binaries with --raptor-path raptor/build/bin Testing :) If everything was properly installed, the following command should show the help pages without errors: ganon -h Running tests :) python3 -m pip install \"parameterized>=0.9.0\" # Alternative: conda install -c conda-forge \"parameterized>=0.9.0\" python3 -m unittest discover -s tests/ganon/integration/ python3 -m unittest discover -s tests/ganon/integration_online/ # optional - downloads large files cd build/ ctest -VV . Parameters :) usage: ganon [-h] [-v] {build,build-custom,update,classify,reassign,report,table} ... - - - - - - - - - - _ _ _ _ _ (_|(_|| |(_)| | _| v. 2.1.0 - - - - - - - - - - positional arguments: {build,build-custom,update,classify,reassign,report,table} build Download and build ganon default databases (refseq/genbank) build-custom Build custom ganon databases update Update ganon default databases classify Classify reads against built databases reassign Reassign reads with multiple matches with an EM algorithm report Generate reports from classification results table Generate table from reports options: -h, --help show this help message and exit -v, --version Show program's version number and exit. ganon build usage: ganon build [-h] [-g [...]] [-a [...]] [-l] [-b [...]] [-o] [-c] [-r] [-u] [-m [...]] [-z [...]] [--skip-genome-size] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -g [ ...], --organism-group [ ...] One or more organism groups to download [archaea, bacteria, fungi, human, invertebrate, metagenomes, other, plant, protozoa, vertebrate_mammalian, vertebrate_other, viral]. Mutually exclusive --taxid (default: None) -a [ ...], --taxid [ ...] One or more taxonomic identifiers to download. e.g. 562 (-x ncbi) or 's__Escherichia coli' (-x gtdb). Mutually exclusive --organism-group (default: None) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) database arguments: -l , --level Highest level to build the database. Options: any available taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis (default: species) download arguments: -b [ ...], --source [ ...] Source to download [refseq, genbank] (default: ['refseq']) -o , --top Download limited assemblies for each taxa. 0 for all. (default: 0) -c, --complete-genomes Download only sub-set of complete genomes (default: False) -r, --representative-genomes Download only sub-set of representative genomes (default: False) -u , --genome-updater Additional genome_updater parameters (https://github.com/pirovc/genome_updater) (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon build-custom usage: ganon build-custom [-h] [-i [...]] [-e] [-c] [-n] [-a] [-l] [-m [...]] [-z [...]] [--skip-genome-size] [-r [...]] [-q [...]] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). Mutually exclusive --input-file. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: fna.gz) -c, --input-recursive Look for files recursively in folder(s) provided with --input (default: False) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) custom arguments: -n , --input-file Tab-separated file with all necessary file/sequence information. Fields: file [ target node specialization specialization name]. For details: https://pirovc.github.io/ganon/custom_databases/. Mutually exclusive --input (default: None) -a , --input-target Target to use [file, sequence]. Parse input by file or by sequence. Using 'file' is recommended and will speed-up the building process (default: file) -l , --level Max. level to build the database. By default, --level is the --input-target. Options: any available taxonomic rank [species, genus, ...] or 'leaves' (requires --taxonomy). Further specialization options [assembly, custom]. assembly will retrieve and use the assembly accession and name. custom requires and uses the specialization field in the --input-file. (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) ncbi arguments: -r [ ...], --ncbi-sequence-info [ ...] Uses NCBI e-utils webservices or downloads accession2taxid files to extract target information. [eutils, nucl_gb, nucl_wgs, nucl_est, nucl_gss, pdb, prot, dead_nucl, dead_wgs, dead_prot or one or more accession2taxid files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/]. By default uses e-utils up-to 50000 sequences or downloads nucl_gb nucl_wgs otherwise. (default: []) -q [ ...], --ncbi-file-info [ ...] Downloads assembly_summary files to extract target information. [refseq, genbank, refseq_historical, genbank_historical or one or more assembly_summary files from https://ftp.ncbi.nlm.nih.gov/genomes/] (default: ['refseq', 'genbank']) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon update usage: ganon update [-h] -d DB_PREFIX [-o] [-t] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -d DB_PREFIX, --db-prefix DB_PREFIX Existing database input prefix (default: None) important arguments: -o , --output-db-prefix Output database prefix. By default will be the same as --db-prefix and overwrite files (default: None) -t , --threads optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon classify usage: ganon classify [-h] -d [DB_PREFIX ...] [-s [reads.fq[.gz] ...]] [-p [reads.1.fq[.gz] reads.2.fq[.gz] ...]] [-c [...]] [-e [...]] [-m] [--ranks [...]] [--min-count] [--report-type] [--skip-report] [-o] [--output-one] [--output-all] [--output-unclassified] [--output-single] [-t] [-b] [-f [...]] [-l [...]] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -d [DB_PREFIX ...], --db-prefix [DB_PREFIX ...] Database input prefix[es] (default: None) -s [reads.fq[.gz] ...], --single-reads [reads.fq[.gz] ...] Multi-fastq[.gz] file[s] to classify (default: None) -p [reads.1.fq[.gz] reads.2.fq[.gz] ...], --paired-reads [reads.1.fq[.gz] reads.2.fq[.gz] ...] Multi-fastq[.gz] pairs of file[s] to classify (default: None) cutoff/filter arguments: -c [ ...], --rel-cutoff [ ...] Min. percentage of a read (set of k-mers) shared with a reference necessary to consider a match. Generally used to remove low similarity matches. Single value or one per database (e.g. 0.7 1 0.25). 0 for no cutoff (default: [0.75]) -e [ ...], --rel-filter [ ...] Additional relative percentage of matches (relative to the best match) to keep. Generally used to keep top matches above cutoff. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [0.1]) post-processing/report arguments: -m , --multiple-matches Method to solve reads with multiple matches [em, lca, skip]. em -> expectation maximization algorithm based on unique matches. lca -> lowest common ancestor based on taxonomy. The EM algorithm can be executed later with 'ganon reassign' using the .all file (--output-all). (default: em) --ranks [ ...] Ranks to report taxonomic abundances (.tre). empty will report default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) --min-count Minimum percentage/counts to report an taxa (.tre) [use values between 0-1 for percentage, >1 for counts] (default: 5e-05) --report-type Type of report (.tre) [abundance, reads, matches, dist, corr]. More info in 'ganon report'. (default: abundance) --skip-report Disable tree-like report (.tre) at the end of classification. Can be done later with 'ganon report'. (default: False) output arguments: -o , --output-prefix Output prefix for output (.rep) and tree-like report (.tre). Empty to output to STDOUT (only .rep) (default: None) --output-one Output a file with one match for each read (.one) either an unique match or a result from the EM or a LCA algorithm (--multiple-matches) (default: False) --output-all Output a file with all unique and multiple matches (.all) (default: False) --output-unclassified Output a file with unclassified read headers (.unc) (default: False) --output-single When using multiple hierarchical levels, output everything in one file instead of one per hierarchy (default: False) other arguments: -t , --threads Number of sub-processes/threads to use (default: 1) -b, --binning Optimized parameters for binning (--rel-cutoff 0.25 --rel-filter 0 --min-count 0 --report-type reads). Will report sequence abundances (.tre) instead of tax. abundance. (default: False) -f [ ...], --fpr-query [ ...] Max. false positive of a query to accept a match. Applied after --rel-cutoff and --rel-filter. Generally used to remove false positives matches querying a database build with large --max-fp. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [1e-05]) -l [ ...], --hierarchy-labels [ ...] Hierarchy definition of --db-prefix files to be classified. Can also be a string, but input will be sorted to define order (e.g. 1 1 2 3). The default value reported without hierarchy is 'H1' (default: None) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon reassign usage: ganon reassign [-h] -i -o OUTPUT_PREFIX [-e] [-s] [--remove-all] [--skip-one] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -i , --input-prefix Input prefix to find files from ganon classify (.all and optionally .rep) (default: None) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for reassigned file (.one and optionally .rep). In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.out' (default: None) EM arguments: -e , --max-iter Max. number of iterations for the EM algorithm. If 0, will run until convergence (check --threshold) (default: 10) -s , --threshold Convergence threshold limit to stop the EM algorithm. (default: 0) other arguments: --remove-all Remove input file (.all) after processing. (default: False) --skip-one Do not write output file (.one) after processing. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon report usage: ganon report [-h] -i [...] [-e INPUT_EXTENSION] -o OUTPUT_PREFIX [-d [...]] [-x] [-m [...]] [-z [...]] [--skip-genome-size] [-f] [-t] [-r [...]] [-s] [-a] [-y] [-p [...]] [-k [...]] [-c] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.rep' file(s) from ganon classify. (default: None) -e INPUT_EXTENSION, --input-extension INPUT_EXTENSION Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: rep) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for report file 'output_prefix.tre'. In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.tre' (default: None) db/tax arguments: -d [ ...], --db-prefix [ ...] Database prefix(es) used for classification. Only '.tax' file(s) are required. If not provided, new taxonomy will be downloaded. Mutually exclusive with --taxonomy. (default: []) -x , --taxonomy Taxonomy database to use [ncbi, gtdb, skip]. Mutually exclusive with --db-prefix. (default: ncbi) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Valid only without --db-prefix. Activate this option when using sequences not representing full genomes. (default: False) output arguments: -f , --output-format Output format [text, tsv, csv, bioboxes]. text outputs a tabulated formatted text file for better visualization. bioboxes is the the CAMI challenge profiling format (only percentage/abundances are reported). (default: tsv) -t , --report-type Type of report [abundance, reads, matches, dist, corr]. 'abundance' -> tax. abundance (re- distribute read counts and correct by genome size), 'reads' -> sequence abundance, 'matches' -> report all unique and shared matches, 'dist' -> like reads with re-distribution of shared read counts only, 'corr' -> like abundance without re-distribution of shared read counts (default: abundance) -r [ ...], --ranks [ ...] Ranks to report ['', 'all', custom list]. 'all' for all possible ranks. empty for default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) -s , --sort Sort report by [rank, lineage, count, unique]. Default: rank (with custom --ranks) or lineage (with --ranks all) (default: ) -a, --no-orphan Omit orphan nodes from the final report. Otherwise, orphan nodes (= nodes not found in the db/tax) are reported as 'na' with root as direct parent. (default: False) -y, --split-hierarchy Split output reports by hierarchy (from ganon classify --hierarchy-labels). If activated, the output files will be named as '{output_prefix}.{hierarchy}.tre' (default: False) -p [ ...], --skip-hierarchy [ ...] One or more hierarchies to skip in the report (from ganon classify --hierarchy-labels) (default: []) -k [ ...], --keep-hierarchy [ ...] One or more hierarchies to keep in the report (from ganon classify --hierarchy-labels) (default: []) -c , --top-percentile Top percentile filter, based on percentage/relative abundance. Applied only at default ranks [superkingdom, phylum, class, order, family, genus, species, assembly] (default: 0) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: []) ganon table usage: ganon table [-h] -i [...] [-e] -o OUTPUT_FILE [-l] [-f] [-t] [-a] [-m] [-r] [-n] [--header] [--unclassified-label] [--filtered-label] [--skip-zeros] [--transpose] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.tre' file(s) from ganon report. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: tre) -o OUTPUT_FILE, --output-file OUTPUT_FILE Output filename for the table (default: None) output arguments: -l , --output-value Output value on the table [percentage, counts]. percentage values are reported between [0-1] (default: counts) -f , --output-format Output format [tsv, csv] (default: tsv) -t , --top-sample Top hits of each sample individually (default: 0) -a , --top-all Top hits of all samples (ranked by percentage) (default: 0) -m , --min-frequency Minimum number/percentage of files containing an taxa to keep the taxa [values between 0-1 for percentage, >1 specific number] (default: 0) -r , --rank Define specific rank to report. Empty will report all ranks. (default: None) -n, --no-root Do not report root node entry and lineage. Direct and shared matches to root will be accounted as unclassified (default: False) --header Header information [name, taxid, lineage] (default: name) --unclassified-label Add column with unclassified count/percentage with the chosen label. May be the same as --filtered-label (e.g. unassigned) (default: None) --filtered-label Add column with filtered count/percentage with the chosen label. May be the same as --unclassified-label (e.g. unassigned) (default: None) --skip-zeros Do not print lines with only zero count/percentage (default: False) --transpose Transpose output table (taxa as cols and files as rows) (default: False) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: [])","title":"ganon2"},{"location":"#ganon","text":"Code: GitHub repository ganon2 pre-print ganon is designed to index large sets of genomic reference sequences and to classify reads against them efficiently. The tool uses Hierarchical Interleaved Bloom Filters as indices based on k-mers with optional minimizers. It was mainly developed, but not limited, to the metagenomics classification problem: quickly assign sequence fragments to their closest reference among thousands of references. After classification, taxonomic or sequence abundances are estimated and reported.","title":"ganon"},{"location":"#features","text":"integrated download and build of any subset from RefSeq/Genbank/GTDB with incremental updates NCBI and GTDB native support for taxonomic classification, custom taxonomy or no taxonomy at all customizable database build for local or non-standard sequence files optimized taxonomic binning and profiling configurations build and classify at various taxonomic levels, strain, assembly, file, sequence or custom specialization hierarchical classification using several databases in one or more levels in just one run EM and/or LCA algorithms to solve multiple-matching reads reporting of multiple and unique matches for every read reporting of sequence, taxonomic or multi-match abundances with optional genome size correction advanced tree-like reports with several filter options generation of contingency tables with several filters for multi-sample studies ganon achieved very good results in our own evaluations but also in independent evaluations: LEMMI , LEMMI v2 and CAMI2","title":"Features"},{"location":"#installation-with-conda","text":"The easiest way to install ganon is via conda, using the bioconda and conda-forge channels: conda install -c bioconda -c conda-forge ganon However, there are possible performance benefits compiling ganon from source in the target machine rather than using the conda version. To do so, please follow the instructions below:","title":"Installation with conda"},{"location":"#installation-from-source","text":"","title":"Installation from source"},{"location":"#python-dependencies","text":"python >=3.6 pandas >=1.2.0 multitax >=1.3.1 genome_updater >=0.6.3 # Python version should be >=3.6 python3 -V # Install packages via pip or conda: # PIP python3 -m pip install \"pandas>=1.2.0\" \"multitax>=1.3.1\" wget --quiet --show-progress https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh && chmod +x genome_updater.sh # Conda/Mamba (alternative) conda install -c bioconda -c conda-forge \"pandas>=1.2.0\" \"multitax>=1.3.1\" \"genome_updater>=0.6.3\"","title":"Python dependencies"},{"location":"#c-dependencies","text":"GCC >=11 CMake >=3.4 zlib bzip2 raptor ==3.0.1 Tip If your system has GCC version 10 or below, you can create an environment with the latest conda-forge GCC version and dependencies: conda create -c conda-forge -n gcc-conda gcc gxx zlib bzip2 cmake and activate the environment with: source activate gcc-conda . In CMake, you may have set the environment include directory with the following parameter: -DSEQAN3_CXX_FLAGS=\"-I/path/to/miniconda3/envs/gcc-conda/include/\" changing /path/to/miniconda3 with your local path to the conda installation.","title":"C++ dependencies"},{"location":"#downloading-and-building-ganon-submodules","text":"git clone --recurse-submodules https://github.com/pirovc/ganon.git # Install Python side cd ganon python3 setup.py install --record files.txt # optional # Compile and install C++ side mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DVERBOSE_CONFIG=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCONDA=OFF -DLONGREADS=OFF .. make -j 4 sudo make install # optional to change install location (e.g. /myprefix/bin/ ), set the installation prefix in the cmake command with -DCMAKE_INSTALL_PREFIX=/myprefix/ use -DINCLUDE_DIRS to set alternative paths to cxxopts and Catch2 libs. to classify extremely large reads or contigs that would need more than 65000 k-mers, use -DLONGREADS=ON","title":"Downloading and building ganon + submodules"},{"location":"#installing-raptor","text":"The easiest way to install raptor is via conda with conda install -c bioconda -c conda-forge \"raptor=3.0.1\" (already included in ganon install via conda). Note raptor is required to build databases with the Hierarchical Interleaved Bloom Filter ( ganon build --filter-type hibf ) To build old style ganon indices ganon build --filter-type ibf , raptor is not required To install raptor from source, follow the instructions below:","title":"Installing raptor"},{"location":"#dependencies","text":"CMake >= 3.18 GCC 11, 12 or 13 (most recent minor version)","title":"Dependencies"},{"location":"#downloading-and-building-raptor-submodules","text":"git clone --branch raptor-v3.0.1 --recurse-submodules https://github.com/seqan/raptor cd raptor mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=\"-std=c++23 -Wno-interference-size\" .. make -j 4 binaries will be located in the bin directory you may have to inform ganon build the path to the binaries with --raptor-path raptor/build/bin","title":"Downloading and building raptor + submodules"},{"location":"#testing","text":"If everything was properly installed, the following command should show the help pages without errors: ganon -h","title":"Testing"},{"location":"#running-tests","text":"python3 -m pip install \"parameterized>=0.9.0\" # Alternative: conda install -c conda-forge \"parameterized>=0.9.0\" python3 -m unittest discover -s tests/ganon/integration/ python3 -m unittest discover -s tests/ganon/integration_online/ # optional - downloads large files cd build/ ctest -VV .","title":"Running tests"},{"location":"#parameters","text":"usage: ganon [-h] [-v] {build,build-custom,update,classify,reassign,report,table} ... - - - - - - - - - - _ _ _ _ _ (_|(_|| |(_)| | _| v. 2.1.0 - - - - - - - - - - positional arguments: {build,build-custom,update,classify,reassign,report,table} build Download and build ganon default databases (refseq/genbank) build-custom Build custom ganon databases update Update ganon default databases classify Classify reads against built databases reassign Reassign reads with multiple matches with an EM algorithm report Generate reports from classification results table Generate table from reports options: -h, --help show this help message and exit -v, --version Show program's version number and exit. ganon build usage: ganon build [-h] [-g [...]] [-a [...]] [-l] [-b [...]] [-o] [-c] [-r] [-u] [-m [...]] [-z [...]] [--skip-genome-size] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -g [ ...], --organism-group [ ...] One or more organism groups to download [archaea, bacteria, fungi, human, invertebrate, metagenomes, other, plant, protozoa, vertebrate_mammalian, vertebrate_other, viral]. Mutually exclusive --taxid (default: None) -a [ ...], --taxid [ ...] One or more taxonomic identifiers to download. e.g. 562 (-x ncbi) or 's__Escherichia coli' (-x gtdb). Mutually exclusive --organism-group (default: None) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) database arguments: -l , --level Highest level to build the database. Options: any available taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis (default: species) download arguments: -b [ ...], --source [ ...] Source to download [refseq, genbank] (default: ['refseq']) -o , --top Download limited assemblies for each taxa. 0 for all. (default: 0) -c, --complete-genomes Download only sub-set of complete genomes (default: False) -r, --representative-genomes Download only sub-set of representative genomes (default: False) -u , --genome-updater Additional genome_updater parameters (https://github.com/pirovc/genome_updater) (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon build-custom usage: ganon build-custom [-h] [-i [...]] [-e] [-c] [-n] [-a] [-l] [-m [...]] [-z [...]] [--skip-genome-size] [-r [...]] [-q [...]] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). Mutually exclusive --input-file. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: fna.gz) -c, --input-recursive Look for files recursively in folder(s) provided with --input (default: False) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) custom arguments: -n , --input-file Tab-separated file with all necessary file/sequence information. Fields: file [ target node specialization specialization name]. For details: https://pirovc.github.io/ganon/custom_databases/. Mutually exclusive --input (default: None) -a , --input-target Target to use [file, sequence]. Parse input by file or by sequence. Using 'file' is recommended and will speed-up the building process (default: file) -l , --level Max. level to build the database. By default, --level is the --input-target. Options: any available taxonomic rank [species, genus, ...] or 'leaves' (requires --taxonomy). Further specialization options [assembly, custom]. assembly will retrieve and use the assembly accession and name. custom requires and uses the specialization field in the --input-file. (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) ncbi arguments: -r [ ...], --ncbi-sequence-info [ ...] Uses NCBI e-utils webservices or downloads accession2taxid files to extract target information. [eutils, nucl_gb, nucl_wgs, nucl_est, nucl_gss, pdb, prot, dead_nucl, dead_wgs, dead_prot or one or more accession2taxid files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/]. By default uses e-utils up-to 50000 sequences or downloads nucl_gb nucl_wgs otherwise. (default: []) -q [ ...], --ncbi-file-info [ ...] Downloads assembly_summary files to extract target information. [refseq, genbank, refseq_historical, genbank_historical or one or more assembly_summary files from https://ftp.ncbi.nlm.nih.gov/genomes/] (default: ['refseq', 'genbank']) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon update usage: ganon update [-h] -d DB_PREFIX [-o] [-t] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -d DB_PREFIX, --db-prefix DB_PREFIX Existing database input prefix (default: None) important arguments: -o , --output-db-prefix Output database prefix. By default will be the same as --db-prefix and overwrite files (default: None) -t , --threads optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon classify usage: ganon classify [-h] -d [DB_PREFIX ...] [-s [reads.fq[.gz] ...]] [-p [reads.1.fq[.gz] reads.2.fq[.gz] ...]] [-c [...]] [-e [...]] [-m] [--ranks [...]] [--min-count] [--report-type] [--skip-report] [-o] [--output-one] [--output-all] [--output-unclassified] [--output-single] [-t] [-b] [-f [...]] [-l [...]] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -d [DB_PREFIX ...], --db-prefix [DB_PREFIX ...] Database input prefix[es] (default: None) -s [reads.fq[.gz] ...], --single-reads [reads.fq[.gz] ...] Multi-fastq[.gz] file[s] to classify (default: None) -p [reads.1.fq[.gz] reads.2.fq[.gz] ...], --paired-reads [reads.1.fq[.gz] reads.2.fq[.gz] ...] Multi-fastq[.gz] pairs of file[s] to classify (default: None) cutoff/filter arguments: -c [ ...], --rel-cutoff [ ...] Min. percentage of a read (set of k-mers) shared with a reference necessary to consider a match. Generally used to remove low similarity matches. Single value or one per database (e.g. 0.7 1 0.25). 0 for no cutoff (default: [0.75]) -e [ ...], --rel-filter [ ...] Additional relative percentage of matches (relative to the best match) to keep. Generally used to keep top matches above cutoff. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [0.1]) post-processing/report arguments: -m , --multiple-matches Method to solve reads with multiple matches [em, lca, skip]. em -> expectation maximization algorithm based on unique matches. lca -> lowest common ancestor based on taxonomy. The EM algorithm can be executed later with 'ganon reassign' using the .all file (--output-all). (default: em) --ranks [ ...] Ranks to report taxonomic abundances (.tre). empty will report default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) --min-count Minimum percentage/counts to report an taxa (.tre) [use values between 0-1 for percentage, >1 for counts] (default: 5e-05) --report-type Type of report (.tre) [abundance, reads, matches, dist, corr]. More info in 'ganon report'. (default: abundance) --skip-report Disable tree-like report (.tre) at the end of classification. Can be done later with 'ganon report'. (default: False) output arguments: -o , --output-prefix Output prefix for output (.rep) and tree-like report (.tre). Empty to output to STDOUT (only .rep) (default: None) --output-one Output a file with one match for each read (.one) either an unique match or a result from the EM or a LCA algorithm (--multiple-matches) (default: False) --output-all Output a file with all unique and multiple matches (.all) (default: False) --output-unclassified Output a file with unclassified read headers (.unc) (default: False) --output-single When using multiple hierarchical levels, output everything in one file instead of one per hierarchy (default: False) other arguments: -t , --threads Number of sub-processes/threads to use (default: 1) -b, --binning Optimized parameters for binning (--rel-cutoff 0.25 --rel-filter 0 --min-count 0 --report-type reads). Will report sequence abundances (.tre) instead of tax. abundance. (default: False) -f [ ...], --fpr-query [ ...] Max. false positive of a query to accept a match. Applied after --rel-cutoff and --rel-filter. Generally used to remove false positives matches querying a database build with large --max-fp. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [1e-05]) -l [ ...], --hierarchy-labels [ ...] Hierarchy definition of --db-prefix files to be classified. Can also be a string, but input will be sorted to define order (e.g. 1 1 2 3). The default value reported without hierarchy is 'H1' (default: None) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon reassign usage: ganon reassign [-h] -i -o OUTPUT_PREFIX [-e] [-s] [--remove-all] [--skip-one] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -i , --input-prefix Input prefix to find files from ganon classify (.all and optionally .rep) (default: None) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for reassigned file (.one and optionally .rep). In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.out' (default: None) EM arguments: -e , --max-iter Max. number of iterations for the EM algorithm. If 0, will run until convergence (check --threshold) (default: 10) -s , --threshold Convergence threshold limit to stop the EM algorithm. (default: 0) other arguments: --remove-all Remove input file (.all) after processing. (default: False) --skip-one Do not write output file (.one) after processing. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon report usage: ganon report [-h] -i [...] [-e INPUT_EXTENSION] -o OUTPUT_PREFIX [-d [...]] [-x] [-m [...]] [-z [...]] [--skip-genome-size] [-f] [-t] [-r [...]] [-s] [-a] [-y] [-p [...]] [-k [...]] [-c] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.rep' file(s) from ganon classify. (default: None) -e INPUT_EXTENSION, --input-extension INPUT_EXTENSION Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: rep) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for report file 'output_prefix.tre'. In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.tre' (default: None) db/tax arguments: -d [ ...], --db-prefix [ ...] Database prefix(es) used for classification. Only '.tax' file(s) are required. If not provided, new taxonomy will be downloaded. Mutually exclusive with --taxonomy. (default: []) -x , --taxonomy Taxonomy database to use [ncbi, gtdb, skip]. Mutually exclusive with --db-prefix. (default: ncbi) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Valid only without --db-prefix. Activate this option when using sequences not representing full genomes. (default: False) output arguments: -f , --output-format Output format [text, tsv, csv, bioboxes]. text outputs a tabulated formatted text file for better visualization. bioboxes is the the CAMI challenge profiling format (only percentage/abundances are reported). (default: tsv) -t , --report-type Type of report [abundance, reads, matches, dist, corr]. 'abundance' -> tax. abundance (re- distribute read counts and correct by genome size), 'reads' -> sequence abundance, 'matches' -> report all unique and shared matches, 'dist' -> like reads with re-distribution of shared read counts only, 'corr' -> like abundance without re-distribution of shared read counts (default: abundance) -r [ ...], --ranks [ ...] Ranks to report ['', 'all', custom list]. 'all' for all possible ranks. empty for default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) -s , --sort Sort report by [rank, lineage, count, unique]. Default: rank (with custom --ranks) or lineage (with --ranks all) (default: ) -a, --no-orphan Omit orphan nodes from the final report. Otherwise, orphan nodes (= nodes not found in the db/tax) are reported as 'na' with root as direct parent. (default: False) -y, --split-hierarchy Split output reports by hierarchy (from ganon classify --hierarchy-labels). If activated, the output files will be named as '{output_prefix}.{hierarchy}.tre' (default: False) -p [ ...], --skip-hierarchy [ ...] One or more hierarchies to skip in the report (from ganon classify --hierarchy-labels) (default: []) -k [ ...], --keep-hierarchy [ ...] One or more hierarchies to keep in the report (from ganon classify --hierarchy-labels) (default: []) -c , --top-percentile Top percentile filter, based on percentage/relative abundance. Applied only at default ranks [superkingdom, phylum, class, order, family, genus, species, assembly] (default: 0) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: []) ganon table usage: ganon table [-h] -i [...] [-e] -o OUTPUT_FILE [-l] [-f] [-t] [-a] [-m] [-r] [-n] [--header] [--unclassified-label] [--filtered-label] [--skip-zeros] [--transpose] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.tre' file(s) from ganon report. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: tre) -o OUTPUT_FILE, --output-file OUTPUT_FILE Output filename for the table (default: None) output arguments: -l , --output-value Output value on the table [percentage, counts]. percentage values are reported between [0-1] (default: counts) -f , --output-format Output format [tsv, csv] (default: tsv) -t , --top-sample Top hits of each sample individually (default: 0) -a , --top-all Top hits of all samples (ranked by percentage) (default: 0) -m , --min-frequency Minimum number/percentage of files containing an taxa to keep the taxa [values between 0-1 for percentage, >1 specific number] (default: 0) -r , --rank Define specific rank to report. Empty will report all ranks. (default: None) -n, --no-root Do not report root node entry and lineage. Direct and shared matches to root will be accounted as unclassified (default: False) --header Header information [name, taxid, lineage] (default: name) --unclassified-label Add column with unclassified count/percentage with the chosen label. May be the same as --filtered-label (e.g. unassigned) (default: None) --filtered-label Add column with filtered count/percentage with the chosen label. May be the same as --unclassified-label (e.g. unassigned) (default: None) --skip-zeros Do not print lines with only zero count/percentage (default: False) --transpose Transpose output table (taxa as cols and files as rows) (default: False) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: [])","title":"Parameters"},{"location":"classification/","text":"Classification :) ganon classify will match single and/or paired-end sets of reads against one or more databases . By default, parameters are optimized for taxonomic profiling , meaning that less reads will be classified but with a higher sensitivity. For example: ganon classify --db-prefix my_db --paired-reads reads.1.fq.gz reads.2.fq.gz --output-prefix results --threads 32 Output files: results.rep : plain report of the run, used to further generate tree-like reports results.tre : tree-like report with cumulative abundances by taxonomic ranks (can be re-generated with ganon report ) By default, ganon classify only write report files. To get files with the classification of each read, use --output-one and/or --output-all . More information about output files here . Note ganon performs taxonomic profiling and/or binning (one tax. assignment for each read) at a taxonomic, strain or sequence level. Some guidelines are listed below, please choose the parameters according to your application. Profiling :) ganon classify is set-up by default to perform taxonomic profiling. It uses: strict thresholds: --rel-cutoff 0.75 and --rel-filter 0.1 --min-count 0.00005 (0.005%) to exclude very low abundant taxa --report-type abundance to generate taxonomic abundances, correcting for genome sizes (more infos here ) Binning :) To achieve better results for taxonomic binning or sequence classification, ganon classify can be configured with --binning , that is the same as: less strict thresholds: --rel-cutoff 0.25 --rel-filter 0 --min-count 0 reports all taxa with at least one read assigned to it --report-type reads will report sequence abundances instead of taxonomic abundances (more infos here ) Tip Database parameters in ganon build can also influence your results. Lower --max-fp (e.g. 0.1, 0.001) and higher --kmer-size (e.g. 23 , 27 ) will improve sensitivity of your results at cost of a larger database and memory usage. Reads with multiple matches :) There are two ways to solve reads with multiple-matches in ganon classify : --multiple-matches em (default): uses an Expectation-Maximization algorithm, re-assigning reads with multiple matches to one most probable target (defined by --level in the build procedure). --multiple-matches lca : uses the Lowest Common Ancestor algorithm, re-assigning reads with multiple matches to higher common ancestors in the taxonomic tree. --multiple-matches skip : will not resolve multi-matching reads Tip The Expectation-Maximization can be performed independently with ganon reassign using the output files .rep and .all . Reports can be generated independently with ganon report using the output file .rep Note --multiple-matches lca paired with --report-type abundance or dist will distribute read counts with multiple matches to one most probable target (defined by --level in the build procedure), instead of a higher taxonomic rank. In this case the distribution is simply based on the number of taxa with unique matches and it is not as precise as the EM algorithm, but it will run faster since the per-read basis re-assignment can be skipped. Classifying more reads :) By default ganon will classify less reads in favour of sensitivity. To classify more reads, use less strict --rel-cutoff and --rel-filter values (e.g. 0.25 and 0 , respectively). More details here . Multiple and Hierarchical classification :) ganon classify can be performed in multiple databases at the same time. The databases can also be provided in a hierarchical order. Multiple database classification can be performed providing several inputs for --db-prefix . They are required to be built with the same --kmer-size and --window-size values. Multiple databases are considered as one (as if built together) and redundancy in content (same reference in two or more databases) is allowed. To classify reads in a hierarchical order, --hierarchy-labels should be provided. When using multiple hierarchical levels, output files will be generated for each level (use --output-single to generate a single output from multiple hierarchical levels). Please note that some parameters are set for each database (e.g. --rel-cutoff ) while others are set for each hierarchical level (e.g. --rel-filter ) Examples Classification against 3 database (as if they were one) using the same cutoff: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.75 \\ --single-reads reads.fq.gz Classification against 3 database (as if they were one) using different error rates for each: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.2 0.3 0.1 \\ --single-reads reads.fq.gz In this example, reads are going to be classified first against db1 and db2. Reads without a valid match will be further classified against db3. `--hierarchy-labels` are strings and are going to be sorted to define the hierarchy order, disregarding input order: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --single-reads reads.fq.gz In this example, classification will be performed with different `--rel-cutoff` for each database. For each hierarchy levels (`1_first` and `2_second`) a different `--rel-filter` will be used: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --rel-cutoff 1 0.5 0.25 \\ --rel-filter 0.1 0.5 \\ --single-reads reads.fq.gz Parameter details :) reads (--single-reads, --paired-reads) :) ganon accepts single-end and paired-end reads. Both types can be use at the same time. In paired-end mode, reads are always reported with the header of the first pair. Paired-end reads are classified in a standard forward-reverse orientation. The max. read length accepted is 65535 (accounting both reads in paired mode). cutoff and filter (--rel-cutoff, --rel-filter) :) ganon has two main parameters to control the number and strictess of matches between reads and references: --rel-cutoff and --rel-filter . Every read can be classified against none, one or more references. ganon will report only matches after cutoff and filter thresholds are applied, based on the number of shared k-mers between sequences (use --rel-cutoff 0 and --rel-filter 1 to deactivate them). The cutoff is the first to be applied. It sets the min. percentage of k-mers of a read to be shared with a reference to consider a match. Next the filter is applied to the remaining matches. filter thresholds are relative to the best and worst scoring match after cutoff and control the percentage of additional matches (if any) should be reported, sorted from the best to worst. filter won't change the total number of matched reads but will change the amount of unique or multi-matched reads. cutoff can be interpreted as the lower bound to discard spurious matches and filter as the fine tuning to control what to keep. In summary: --rel-cutoff controls the strictness of the matching algorithm. lower values -> more read matches higher values -> less read matches --rel-filter controls how many matches each read will have, from best to worst lower values -> more unique matching reads higher values -> more multi-matching reads For example, using a hypothetical number of k-mer matches, a certain read with 82 k-mers has the following matches with the 5 references ( ref1..5 ), sorted number of shared k-mers: reference shared k-mers ref1 82 ref2 68 ref3 44 ref4 25 ref5 20 With --rel-cutoff 0.25 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 ref1 82 ref2 68 ref3 44 ref4 25 ~~ref5~~ ~~20~~ X since the --rel-cutoff threshold is 82 * 0.25 = 21 (ceiling is applied). Next, with --rel-filter 0.5 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 --rel-filter 0.5 ref1 82 ref2 68 ~~ref3~~ ~~44~~ X ~~ref4~~ ~~25~~ X ~~ref5~~ ~~20~~ X since 82 is the best match and 25 is the worst remaining match, the filter will keep the top the remaining matches, based on the shared k-mers threshold 82 - ((82-25)*0.5) = 54 (ceiling is applied). ref1 and ref2 are reported as matches Tip The actual number of unique k-mers in a read are used as an upper bound to calculate the thresholds. The same is applied when using --window-size and minimizers. Note A different --rel-cutoff can be set for every database in a multiple or hierarchical database classification. A different --rel-filter can be set for every level of a hierarchical database classification. Note Reads that remain with only one reference match (after cutoff and filter are applied) are considered a unique match. False positive of a query (--fpr-query) :) ganon uses Bloom Filters, probabilistic data structures that may return false positive results. The base false positive of a ganon index is controlled by --max-fp when building the database. However, this value is the expected false positive for each k-mer. In practice, a sequence (several k-mers) will have a way smaller false positive. ganon calculates the false positive rate of a query as suggested by (Solomon and Kingsford, 2016). The --fpr-query will control the max. value accepted to consider a match between a sequence and a reference, avoiding false positives that may be introduce by the properties of the data structure. By default, --fpr-query 1e-5 is used and it is applied after the --rel-cutoff and --rel-filter . Values between 1e-3 and 1e-10 are recommended. This threshold becomes more important when building smaller databases with higher --max-fp , assuring that the false positive is under control. In this case however, sensitivity of results may decrease. Note The false positive of a query was first propose in: Solomon, Brad, and Carl Kingsford. \u201cFast Search of Thousands of Short-Read Sequencing Experiments.\u201d Nature Biotechnology 34, no. 3 (2016): 1\u20136. https://doi.org/10.1038/nbt.3442.","title":"Classification (ganon classify)"},{"location":"classification/#classification","text":"ganon classify will match single and/or paired-end sets of reads against one or more databases . By default, parameters are optimized for taxonomic profiling , meaning that less reads will be classified but with a higher sensitivity. For example: ganon classify --db-prefix my_db --paired-reads reads.1.fq.gz reads.2.fq.gz --output-prefix results --threads 32 Output files: results.rep : plain report of the run, used to further generate tree-like reports results.tre : tree-like report with cumulative abundances by taxonomic ranks (can be re-generated with ganon report ) By default, ganon classify only write report files. To get files with the classification of each read, use --output-one and/or --output-all . More information about output files here . Note ganon performs taxonomic profiling and/or binning (one tax. assignment for each read) at a taxonomic, strain or sequence level. Some guidelines are listed below, please choose the parameters according to your application.","title":"Classification"},{"location":"classification/#profiling","text":"ganon classify is set-up by default to perform taxonomic profiling. It uses: strict thresholds: --rel-cutoff 0.75 and --rel-filter 0.1 --min-count 0.00005 (0.005%) to exclude very low abundant taxa --report-type abundance to generate taxonomic abundances, correcting for genome sizes (more infos here )","title":"Profiling"},{"location":"classification/#binning","text":"To achieve better results for taxonomic binning or sequence classification, ganon classify can be configured with --binning , that is the same as: less strict thresholds: --rel-cutoff 0.25 --rel-filter 0 --min-count 0 reports all taxa with at least one read assigned to it --report-type reads will report sequence abundances instead of taxonomic abundances (more infos here ) Tip Database parameters in ganon build can also influence your results. Lower --max-fp (e.g. 0.1, 0.001) and higher --kmer-size (e.g. 23 , 27 ) will improve sensitivity of your results at cost of a larger database and memory usage.","title":"Binning"},{"location":"classification/#reads-with-multiple-matches","text":"There are two ways to solve reads with multiple-matches in ganon classify : --multiple-matches em (default): uses an Expectation-Maximization algorithm, re-assigning reads with multiple matches to one most probable target (defined by --level in the build procedure). --multiple-matches lca : uses the Lowest Common Ancestor algorithm, re-assigning reads with multiple matches to higher common ancestors in the taxonomic tree. --multiple-matches skip : will not resolve multi-matching reads Tip The Expectation-Maximization can be performed independently with ganon reassign using the output files .rep and .all . Reports can be generated independently with ganon report using the output file .rep Note --multiple-matches lca paired with --report-type abundance or dist will distribute read counts with multiple matches to one most probable target (defined by --level in the build procedure), instead of a higher taxonomic rank. In this case the distribution is simply based on the number of taxa with unique matches and it is not as precise as the EM algorithm, but it will run faster since the per-read basis re-assignment can be skipped.","title":"Reads with multiple matches"},{"location":"classification/#classifying-more-reads","text":"By default ganon will classify less reads in favour of sensitivity. To classify more reads, use less strict --rel-cutoff and --rel-filter values (e.g. 0.25 and 0 , respectively). More details here .","title":"Classifying more reads"},{"location":"classification/#multiple-and-hierarchical-classification","text":"ganon classify can be performed in multiple databases at the same time. The databases can also be provided in a hierarchical order. Multiple database classification can be performed providing several inputs for --db-prefix . They are required to be built with the same --kmer-size and --window-size values. Multiple databases are considered as one (as if built together) and redundancy in content (same reference in two or more databases) is allowed. To classify reads in a hierarchical order, --hierarchy-labels should be provided. When using multiple hierarchical levels, output files will be generated for each level (use --output-single to generate a single output from multiple hierarchical levels). Please note that some parameters are set for each database (e.g. --rel-cutoff ) while others are set for each hierarchical level (e.g. --rel-filter ) Examples Classification against 3 database (as if they were one) using the same cutoff: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.75 \\ --single-reads reads.fq.gz Classification against 3 database (as if they were one) using different error rates for each: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.2 0.3 0.1 \\ --single-reads reads.fq.gz In this example, reads are going to be classified first against db1 and db2. Reads without a valid match will be further classified against db3. `--hierarchy-labels` are strings and are going to be sorted to define the hierarchy order, disregarding input order: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --single-reads reads.fq.gz In this example, classification will be performed with different `--rel-cutoff` for each database. For each hierarchy levels (`1_first` and `2_second`) a different `--rel-filter` will be used: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --rel-cutoff 1 0.5 0.25 \\ --rel-filter 0.1 0.5 \\ --single-reads reads.fq.gz","title":"Multiple and Hierarchical classification"},{"location":"classification/#parameter-details","text":"","title":"Parameter details"},{"location":"classification/#reads-single-reads-paired-reads","text":"ganon accepts single-end and paired-end reads. Both types can be use at the same time. In paired-end mode, reads are always reported with the header of the first pair. Paired-end reads are classified in a standard forward-reverse orientation. The max. read length accepted is 65535 (accounting both reads in paired mode).","title":"reads (--single-reads, --paired-reads)"},{"location":"classification/#cutoff-and-filter-rel-cutoff-rel-filter","text":"ganon has two main parameters to control the number and strictess of matches between reads and references: --rel-cutoff and --rel-filter . Every read can be classified against none, one or more references. ganon will report only matches after cutoff and filter thresholds are applied, based on the number of shared k-mers between sequences (use --rel-cutoff 0 and --rel-filter 1 to deactivate them). The cutoff is the first to be applied. It sets the min. percentage of k-mers of a read to be shared with a reference to consider a match. Next the filter is applied to the remaining matches. filter thresholds are relative to the best and worst scoring match after cutoff and control the percentage of additional matches (if any) should be reported, sorted from the best to worst. filter won't change the total number of matched reads but will change the amount of unique or multi-matched reads. cutoff can be interpreted as the lower bound to discard spurious matches and filter as the fine tuning to control what to keep. In summary: --rel-cutoff controls the strictness of the matching algorithm. lower values -> more read matches higher values -> less read matches --rel-filter controls how many matches each read will have, from best to worst lower values -> more unique matching reads higher values -> more multi-matching reads For example, using a hypothetical number of k-mer matches, a certain read with 82 k-mers has the following matches with the 5 references ( ref1..5 ), sorted number of shared k-mers: reference shared k-mers ref1 82 ref2 68 ref3 44 ref4 25 ref5 20 With --rel-cutoff 0.25 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 ref1 82 ref2 68 ref3 44 ref4 25 ~~ref5~~ ~~20~~ X since the --rel-cutoff threshold is 82 * 0.25 = 21 (ceiling is applied). Next, with --rel-filter 0.5 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 --rel-filter 0.5 ref1 82 ref2 68 ~~ref3~~ ~~44~~ X ~~ref4~~ ~~25~~ X ~~ref5~~ ~~20~~ X since 82 is the best match and 25 is the worst remaining match, the filter will keep the top the remaining matches, based on the shared k-mers threshold 82 - ((82-25)*0.5) = 54 (ceiling is applied). ref1 and ref2 are reported as matches Tip The actual number of unique k-mers in a read are used as an upper bound to calculate the thresholds. The same is applied when using --window-size and minimizers. Note A different --rel-cutoff can be set for every database in a multiple or hierarchical database classification. A different --rel-filter can be set for every level of a hierarchical database classification. Note Reads that remain with only one reference match (after cutoff and filter are applied) are considered a unique match.","title":"cutoff and filter (--rel-cutoff, --rel-filter)"},{"location":"classification/#false-positive-of-a-query-fpr-query","text":"ganon uses Bloom Filters, probabilistic data structures that may return false positive results. The base false positive of a ganon index is controlled by --max-fp when building the database. However, this value is the expected false positive for each k-mer. In practice, a sequence (several k-mers) will have a way smaller false positive. ganon calculates the false positive rate of a query as suggested by (Solomon and Kingsford, 2016). The --fpr-query will control the max. value accepted to consider a match between a sequence and a reference, avoiding false positives that may be introduce by the properties of the data structure. By default, --fpr-query 1e-5 is used and it is applied after the --rel-cutoff and --rel-filter . Values between 1e-3 and 1e-10 are recommended. This threshold becomes more important when building smaller databases with higher --max-fp , assuring that the false positive is under control. In this case however, sensitivity of results may decrease. Note The false positive of a query was first propose in: Solomon, Brad, and Carl Kingsford. \u201cFast Search of Thousands of Short-Read Sequencing Experiments.\u201d Nature Biotechnology 34, no. 3 (2016): 1\u20136. https://doi.org/10.1038/nbt.3442.","title":"False positive of a query (--fpr-query)"},{"location":"custom_databases/","text":"Custom databases :) Besides the automated download and build ( ganon build ) ganon provides a highly customizable build procedure ( ganon build-custom ) to create databases from local sequence files. The usage of this procedure depends on the configuration of your files: Filename like GCA_002211645.1_ASM221164v1_genomic.fna.gz : genomic fasta files in the NCBI standard, with assembly accession in the beginning of the filename. Provide the files with the --input parameter. ganon will try to retrieve all necessary information to build the database. Headers like >NC_006297.1 Bacteroides fragilis YCH46 ... : sequence headers are in the NCBI standard, with sequence accession in after > and with a space afterwards (or line break). Provide the files with the --input parameter and set --input-target sequence . ganon will try to retrieve all necessary information to build the database. For non-standard filenames and headers, follow this Warning --input-target sequence will be slower to build and will use more disk space, since files have be re-written separately for each sequence. More information about building by file or sequence can be found here . The --level is a important parameter that will define the (max.) classification level for the database ( more infos ): --level file or sequence -> default behavior (depending on --input-target ), use file/sequence as classification target --level assembly -> will retrieve assembly related to the file/sequence, use assembly as classification target --level leaves or species , genus ,... -> group input by taxonomy, use tax. nodes at the rank chosen as classification target More infos about other parameters here . Non-standard files/headers with --input-file :) Alternatively to the automatic input methods, it is possible to manually define the input with either standard or non-standard filenames, accessions and headers to build custom databases with --input-file . This file should contain the following fields (tab-separated): file [ target node specialization specialization_name]. file : relative or full path to the sequence file target : any unique text to name the file, to be used in the taxonomy node : taxonomic node (e.g. taxid) to link entry with taxonomy specialization : creates a specialized taxonomic level with a custom name, allowing files to be grouped specialization_name : a name for the specialization, to be used in the taxonomy Warning the target and specialization fields (2nd and 4th col) cannot be the same as the node (3rd col) Below you find example of --input-file . Note they are slightly different depending on the --input-target chosen. They need to be tab-separated to be properly parsed (tsv). Examples of --input-file using the default --input-target file :) List of files :) sequences.fasta others.fasta No taxonomic information is provided so --taxonomy skip should be set. The classification against the generated database will be performed at file level ( --level file ), since that is the only available information given. List of files with alternative names :) sequences.fasta sequences others.fasta others Just like above, but with a specific name to be used for each file. Files and taxonomy :) sequences.fasta sequences 562 others.fasta others 623 The classification max. level against this database will depend on the value set for --level : --level file -> use the file (named with target) with node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy Files, taxonomy and specialization :) sequences.fasta sequences 562 ID44444 Escherichia coli TW10119 others.fasta others 623 ID55555 Shigella flexneri 1a The classification max. level against this database will depend on the value set for --level : --level custom -> use the specialization (named with specialization_name) with node as parent --level file -> use the file (named with target) as a tax. node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy Examples of --input-file using --input-target sequence :) To provide a tabular information for every sequence in your files, you need to use the target field (2nd col.) of the --input-file to input sequence headers. For example: Sequences and taxonomy :) sequences.fasta NZ_CP054001.1 562 sequences.fasta NZ_CP117955.1 623 others.fasta header1 666 others.fasta header2 666 The classification max. level against this database will depend on the value set for --level : --level sequence -> use the sequence header with node as parent --level assembly -> will attempt to retrieve the assembly related to the sequence with node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy Sequences, taxonomy and specialization :) sequences.fasta NZ_CP054001.1 562 ID44444 Escherichia coli TW10119 sequences.fasta NZ_CP117955.1 623 ID55555 Shigella flexneri 1a others.fasta header1 666 StrainA My Strain others.fasta header2 666 StrainA My Strain The classification max. level against this database will depend on the value set for --level : --level custom -> use the specialization (named with specialization_name) with node as parent --level sequence -> use the sequence header with node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy Examples :) Below you will find some examples from commonly used repositories for metagenomics analysis with ganon build-custom : HumGut :) Collection of >30000 genomes from healthy human metagenomes. Article / Website . # Download sequence files wget --quiet --show-progress \"http://arken.nmbu.no/~larssn/humgut/HumGut.tar.gz\" tar xf HumGut.tar.gz # Download taxonomy and metadata files wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_names.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/HumGut.tsv\" # Generate --input-file from metadata tail -n+2 HumGut.tsv | awk -F\"\\t\" '{print \"fna/\"$21\"\\t\"$1\"\\t\"$2}' > HumGut_ganon_input_file.tsv # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files ncbi_nodes.dmp ncbi_names.dmp --db-prefix HumGut --level strain --threads 32 Similarly using GTDB taxonomy files: # Download taxonomy files wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_names.dmp\" # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files gtdb_nodes.dmp gtdb_names.dmp --db-prefix HumGut_gtdb --level strain --threads 32 Note There is no need to use ganon's gtdb integration here since GTDB files in NCBI format are available Plasmid, Plastid and Mitochondrion from RefSeq :) Extra repositories from RefSeq release not included as default databases. Website . # Download sequence files wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/\" ganon build-custom --input plasmid.* plastid.* mitochondrion.* --db-prefix ppm --level species --threads 8 --input-target sequence UniVec, UniVec_core :) \"UniVec is a non-redundant database of sequences commonly attached to cDNA or genomic DNA during the cloning process.\" Website . Useful to screen for vector and linker/adapter contamination. UniVec_core is a sub-set of the UniVec selected to reduce the false positive hits from real biological sources. # UniVec wget -O \"UniVec.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec\" echo -e \"UniVec.fasta\\tUniVec\\t81077\" > UniVec_ganon_input_file.tsv ganon build-custom --input-file UniVec_ganon_input_file.tsv --db-prefix UniVec --level leaves --threads 8 --skip-genome-size # UniVec_Core wget -O \"UniVec_Core.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec_Core\" echo -e \"UniVec_Core.fasta\\tUniVec_Core\\t81077\" > UniVec_Core_ganon_input_file.tsv ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves --threads 8 --skip-genome-size Note All UniVec entries in the examples are categorized as Artificial Sequence (NCBI txid:81077). Some are completely artificial but others may be derived from real biological sources. More information in this link . MGnify genome catalogues (MAGs) :) \"Genome catalogues are biome-specific collections of metagenomic-assembled and isolate genomes\". Article / Website / FTP . Currently available genome catalogues (2024-02-09): chicken-gut cow-rumen honeybee-gut human-gut human-oral human-vaginal marine mouse-gut non-model-fish-gut pig-gut zebrafish-fecal List currently available entries curl --silent --list-only ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/ Example on how to download and build the human-oral catalog: # Download metadata wget \"https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-oral/v1.0.1/genomes-all_metadata.tsv\" # Download sequence files with 12 threads tail -n+2 genomes-all_metadata.tsv | cut -f 1,20 | xargs -P 12 -n2 sh -c 'curl --silent ${1}| gzip -d | sed -e \"1,/##FASTA/ d\" | gzip > ${0}.fna.gz' # Generate ganon input file tail -n+2 genomes-all_metadata.tsv | cut -f 1,15 | tr ';' '\\t' | awk -F\"\\t\" '{tax=\"1\";for(i=NF;i>1;i--){if(length($i)>3){tax=$i;break;}};print $1\".fna.gz\\t\"$1\"\\t\"tax}' > ganon_input_file.tsv # Build ganon database ganon build-custom --input-file ganon_input_file.tsv --db-prefix mgnify_human_oral_v101 --taxonomy gtdb --level leaves --threads 8 Note MGnify genomes catalogues will be build with GTDB taxonomy. Pathogen detection FDA-ARGOS :) A collection of >1400 \"microbes that include biothreat microorganisms, common clinical pathogens and closely related species\". Article / Website / BioProject . # Download sequence files wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt grep \"strain=FDAARGOS\" assembly_summary_refseq.txt > fdaargos_assembly_summary.txt genome_updater.sh -e fdaargos_assembly_summary.txt -f \"genomic.fna.gz\" -o download -m -t 12 # Build ganon database ganon build-custom --input download/ --input-recursive --db-prefix fdaargos --ncbi-file-info download/assembly_summary.txt --level assembly --threads 32 Note The example above uses genome_updater to download files BLAST databases (nt env_nt nt_prok ...) :) BLAST databases. Website / FTP . Current available nucleotide databases (2024-02-09): 16S_ribosomal_RNA 18S_fungal_sequences 28S_fungal_sequences Betacoronavirus env_nt human_genome ITS_eukaryote_sequences ITS_RefSeq_Fungi LSU_eukaryote_rRNA LSU_prokaryote_rRNA mito mouse_genome nt nt_euk nt_others nt_prok nt_viruses patnt pdbnt ref_euk_rep_genomes ref_prok_rep_genomes refseq_rna refseq_select_rna ref_viroids_rep_genomes ref_viruses_rep_genomes SSU_eukaryote_rRNA tsa_nt List currently available entries curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"nucl-metadata.json\" | sed 's/-nucl-metadata.json/, /g' | sort Warning Some BLAST databases are very big and may require extreme computational resources to build. You may need to use some reduction strategies . The example shows how to download , parse and build a ganon database from BLAST database files. It does so by splitting the database into taxonomic specific files, to speed-up the build process: # Define BLAST db db=\"16S_ribosomal_RNA\" threads=8 # Download BLAST db - re-run this command many times until all finish (no more output) curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"^${db}\\..*tar.gz$\" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress \"https://ftp.ncbi.nlm.nih.gov/blast/db/{}\" # OPTIONAL Download and check MD5 wget -O - -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}\\.*tar.gz.md5\" > \"${db}.md5\" find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads:-1} -I{} md5sum {} > \"${db}_downloaded.md5\" diff -sy <(sort -k 2,2 \"${db}.md5\") <(sort -k 2,2 \"${db}_downloaded.md5\") # Should print \"Files /dev/fd/xx and /dev/fd/xx are identical\" # Extract BLAST db files, if successful, remove .tar.gz find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > \"${db}_extracted_files.txt\" # Create folder to write sequence files (split into 10 sub-folders) seq 0 9 | xargs -i mkdir -p \"${db}\"/{} # This command extracts sequences from the blastdb and writes them into taxid specific files # It also generates the --input-file for ganon with the fields: filepath file taxid blastdbcmd -entry all -db \"${db}\" -outfmt \"%a %T %s\" | \\ awk -v db=\"$(realpath ${db})\" '{file=db\"/\"substr($2,1,1)\"/\"$2\".fna\"; print \">\"$1\"\\n\"$3 >> file; print file\"\\t\"$2\".fna\\t\"$2}' | \\ sort | uniq > \"${db}_ganon_input_file.tsv\" # Build ganon database ganon build-custom --input-file \"${db}_ganon_input_file.tsv\" --db-prefix \"${db}\" --threads ${threads} --level leaves # Delete extracted files and auxiliary files cat \"${db}_extracted_files.txt\" | xargs rm rm \"${db}_extracted_files.txt\" \"${db}.md5\" \"${db}_downloaded.md5\" # Delete sequences and input_file rm -rf \"${db}\" \"${db}_ganon_input_file.tsv\" Note blastdbcmd is a command from BLAST+ software suite (tested version 2.14.0) and should be installed separately. Files from genome_updater :) To create a ganon database from files previously downloaded with genome_updater : ganon build-custom --input output_folder_genome_updater/version/ --input-recursive --db-prefix mydb --ncbi-file-info output_folder_genome_updater/assembly_summary.txt --level assembly --threads 32 Parameter details :) False positive and size (--max-fp, --filter-size) :) ganon indices are based on bloom filters and can have false positive matches. This can be controlled with --max-fp parameter. The lower the --max-fp , the less chances of false positives matches on classification, but the larger the database size will be. For example, with --max-fp 0.01 the database will be build so any target (defined by --level ) will have 1 in a 100 change of reporting a false k-mer match. The false positive of the query (all k-mers of a read) will be way lower, but directly affected by this value. Alternatively, one can set a specific size for the final index with --filter-size . When using this option, please observe the theoretic false positive of the index reported at the end of the building process. minimizers (--window-size, --kmer-size) :) in ganon build , when --window-size > --kmer-size minimizers are used. That means that for a every window, a single k-mer will be selected. It produces smaller database files and requires substantially less memory overall. It may increase building times but will have a huge benefit for classification times. Sensitivity and precision can be reduced by small margins. If --window-size = --kmer-size , all k-mers are going to be used to build the database. Target file or sequence (--input-target) :) This is a parameter that defines how ganon will parse your input files: - --input-target file (default) will consider every file provided with --input a single unit (e.g. multi-fasta files are considered one input, sequence headers ignored). - --input-target sequence will use every sequence as a unit. For this, ganon will first decompose every sequence in the input files provided with --input into a separated file. This will take longer and use more disk space. --input-target file is the default behavior and most efficient way to build databases. --input-target sequence should only be used when the input sequences are not separated by file (e.g. a single big FASTA file) or when classification at sequence level is desired. Build level (--level) :) The --level parameter defines the max. depth of the database for classification. This parameter is relevant because the --max-fp is going to be guaranteed at the --level chosen. In ganon build the default value is species . In ganon build-custom the level will be the same as --input-target , meaning that classification will be done either at file or sequence level. Alternatively, --level assembly will link the file or sequence target information with assembly accessions retrieved from NCBI. --level leaves or --level species (or genus , family , ...) will link the targets with taxonomic information and prune the tree at the chosen level. --level custom will use specialization (4th col.) defined in the --input-file . Genome sizes (--genome-size-files) :) Ganon will automatically download auxiliary files to define an approximate genome size for each entry in the taxonomic tree. For --taxonomy ncbi the species_genome_size.txt.gz is used. For --taxonomy gtdb the *_metadata.tar.gz files are used. Those files can be directly provided with the --genome-size-files argument. Genome sizes of parent nodes are calculated as the average of the respective children nodes. Other nodes without direct assigned genome sizes will use the closest parent with a pre-calculated genome size. The genome sizes are stored in the ganon database . Retrieving info (--ncbi-sequence-info, --ncbi-file-info) :) Further taxonomy and assembly linking information has to be collected to properly build the database. --ncbi-sequence-info and --ncbi-file-info allow customizations on this step. When --input-target sequence , --ncbi-sequence-info argument allows the use of NCBI e-utils webservices ( eutils ) or downloads accession2taxid files to extract target information (options nucl_gb nucl_wgs nucl_est nucl_gss pdb prot dead_nucl dead_wgs dead_prot ). By default, ganon uses eutils up-to 50000 input sequences, otherwise it downloads nucl_gb nucl_wgs from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/. Previously downloaded files can be directly provided with this argument. When --input-target file , --ncbi-file-info uses assembly_summary.txt from https://ftp.ncbi.nlm.nih.gov/genomes/ to extract target information (options refseq genbank refseq_historical genbank_historical . Previously downloaded files can be directly provided with this argument. If you are using outdated, removed or inactive assembly or sequence files and accessions from NCBI, make sure to include dead_nucl dead_wgs for --ncbi-sequence-info or refseq_historical genbank_historical for --ncbi-file-info . eutils option does not work with outdated accessions.","title":"Custom databases (ganon build-custom)"},{"location":"custom_databases/#custom-databases","text":"Besides the automated download and build ( ganon build ) ganon provides a highly customizable build procedure ( ganon build-custom ) to create databases from local sequence files. The usage of this procedure depends on the configuration of your files: Filename like GCA_002211645.1_ASM221164v1_genomic.fna.gz : genomic fasta files in the NCBI standard, with assembly accession in the beginning of the filename. Provide the files with the --input parameter. ganon will try to retrieve all necessary information to build the database. Headers like >NC_006297.1 Bacteroides fragilis YCH46 ... : sequence headers are in the NCBI standard, with sequence accession in after > and with a space afterwards (or line break). Provide the files with the --input parameter and set --input-target sequence . ganon will try to retrieve all necessary information to build the database. For non-standard filenames and headers, follow this Warning --input-target sequence will be slower to build and will use more disk space, since files have be re-written separately for each sequence. More information about building by file or sequence can be found here . The --level is a important parameter that will define the (max.) classification level for the database ( more infos ): --level file or sequence -> default behavior (depending on --input-target ), use file/sequence as classification target --level assembly -> will retrieve assembly related to the file/sequence, use assembly as classification target --level leaves or species , genus ,... -> group input by taxonomy, use tax. nodes at the rank chosen as classification target More infos about other parameters here .","title":"Custom databases"},{"location":"custom_databases/#non-standard-filesheaders-with-input-file","text":"Alternatively to the automatic input methods, it is possible to manually define the input with either standard or non-standard filenames, accessions and headers to build custom databases with --input-file . This file should contain the following fields (tab-separated): file [ target node specialization specialization_name]. file : relative or full path to the sequence file target : any unique text to name the file, to be used in the taxonomy node : taxonomic node (e.g. taxid) to link entry with taxonomy specialization : creates a specialized taxonomic level with a custom name, allowing files to be grouped specialization_name : a name for the specialization, to be used in the taxonomy Warning the target and specialization fields (2nd and 4th col) cannot be the same as the node (3rd col) Below you find example of --input-file . Note they are slightly different depending on the --input-target chosen. They need to be tab-separated to be properly parsed (tsv).","title":"Non-standard files/headers with --input-file"},{"location":"custom_databases/#examples-of-input-file-using-the-default-input-target-file","text":"","title":"Examples of --input-file using the default --input-target file"},{"location":"custom_databases/#list-of-files","text":"sequences.fasta others.fasta No taxonomic information is provided so --taxonomy skip should be set. The classification against the generated database will be performed at file level ( --level file ), since that is the only available information given.","title":"List of files"},{"location":"custom_databases/#list-of-files-with-alternative-names","text":"sequences.fasta sequences others.fasta others Just like above, but with a specific name to be used for each file.","title":"List of files with alternative names"},{"location":"custom_databases/#files-and-taxonomy","text":"sequences.fasta sequences 562 others.fasta others 623 The classification max. level against this database will depend on the value set for --level : --level file -> use the file (named with target) with node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy","title":"Files and taxonomy"},{"location":"custom_databases/#files-taxonomy-and-specialization","text":"sequences.fasta sequences 562 ID44444 Escherichia coli TW10119 others.fasta others 623 ID55555 Shigella flexneri 1a The classification max. level against this database will depend on the value set for --level : --level custom -> use the specialization (named with specialization_name) with node as parent --level file -> use the file (named with target) as a tax. node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy","title":"Files, taxonomy and specialization"},{"location":"custom_databases/#examples-of-input-file-using-input-target-sequence","text":"To provide a tabular information for every sequence in your files, you need to use the target field (2nd col.) of the --input-file to input sequence headers. For example:","title":"Examples of --input-file using --input-target sequence"},{"location":"custom_databases/#sequences-and-taxonomy","text":"sequences.fasta NZ_CP054001.1 562 sequences.fasta NZ_CP117955.1 623 others.fasta header1 666 others.fasta header2 666 The classification max. level against this database will depend on the value set for --level : --level sequence -> use the sequence header with node as parent --level assembly -> will attempt to retrieve the assembly related to the sequence with node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy","title":"Sequences and taxonomy"},{"location":"custom_databases/#sequences-taxonomy-and-specialization","text":"sequences.fasta NZ_CP054001.1 562 ID44444 Escherichia coli TW10119 sequences.fasta NZ_CP117955.1 623 ID55555 Shigella flexneri 1a others.fasta header1 666 StrainA My Strain others.fasta header2 666 StrainA My Strain The classification max. level against this database will depend on the value set for --level : --level custom -> use the specialization (named with specialization_name) with node as parent --level sequence -> use the sequence header with node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy","title":"Sequences, taxonomy and specialization"},{"location":"custom_databases/#examples","text":"Below you will find some examples from commonly used repositories for metagenomics analysis with ganon build-custom :","title":"Examples"},{"location":"custom_databases/#humgut","text":"Collection of >30000 genomes from healthy human metagenomes. Article / Website . # Download sequence files wget --quiet --show-progress \"http://arken.nmbu.no/~larssn/humgut/HumGut.tar.gz\" tar xf HumGut.tar.gz # Download taxonomy and metadata files wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_names.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/HumGut.tsv\" # Generate --input-file from metadata tail -n+2 HumGut.tsv | awk -F\"\\t\" '{print \"fna/\"$21\"\\t\"$1\"\\t\"$2}' > HumGut_ganon_input_file.tsv # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files ncbi_nodes.dmp ncbi_names.dmp --db-prefix HumGut --level strain --threads 32 Similarly using GTDB taxonomy files: # Download taxonomy files wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_names.dmp\" # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files gtdb_nodes.dmp gtdb_names.dmp --db-prefix HumGut_gtdb --level strain --threads 32 Note There is no need to use ganon's gtdb integration here since GTDB files in NCBI format are available","title":"HumGut"},{"location":"custom_databases/#plasmid-plastid-and-mitochondrion-from-refseq","text":"Extra repositories from RefSeq release not included as default databases. Website . # Download sequence files wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/\" ganon build-custom --input plasmid.* plastid.* mitochondrion.* --db-prefix ppm --level species --threads 8 --input-target sequence","title":"Plasmid, Plastid and Mitochondrion from RefSeq"},{"location":"custom_databases/#univec-univec_core","text":"\"UniVec is a non-redundant database of sequences commonly attached to cDNA or genomic DNA during the cloning process.\" Website . Useful to screen for vector and linker/adapter contamination. UniVec_core is a sub-set of the UniVec selected to reduce the false positive hits from real biological sources. # UniVec wget -O \"UniVec.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec\" echo -e \"UniVec.fasta\\tUniVec\\t81077\" > UniVec_ganon_input_file.tsv ganon build-custom --input-file UniVec_ganon_input_file.tsv --db-prefix UniVec --level leaves --threads 8 --skip-genome-size # UniVec_Core wget -O \"UniVec_Core.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec_Core\" echo -e \"UniVec_Core.fasta\\tUniVec_Core\\t81077\" > UniVec_Core_ganon_input_file.tsv ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves --threads 8 --skip-genome-size Note All UniVec entries in the examples are categorized as Artificial Sequence (NCBI txid:81077). Some are completely artificial but others may be derived from real biological sources. More information in this link .","title":"UniVec, UniVec_core"},{"location":"custom_databases/#mgnify-genome-catalogues-mags","text":"\"Genome catalogues are biome-specific collections of metagenomic-assembled and isolate genomes\". Article / Website / FTP . Currently available genome catalogues (2024-02-09): chicken-gut cow-rumen honeybee-gut human-gut human-oral human-vaginal marine mouse-gut non-model-fish-gut pig-gut zebrafish-fecal List currently available entries curl --silent --list-only ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/ Example on how to download and build the human-oral catalog: # Download metadata wget \"https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-oral/v1.0.1/genomes-all_metadata.tsv\" # Download sequence files with 12 threads tail -n+2 genomes-all_metadata.tsv | cut -f 1,20 | xargs -P 12 -n2 sh -c 'curl --silent ${1}| gzip -d | sed -e \"1,/##FASTA/ d\" | gzip > ${0}.fna.gz' # Generate ganon input file tail -n+2 genomes-all_metadata.tsv | cut -f 1,15 | tr ';' '\\t' | awk -F\"\\t\" '{tax=\"1\";for(i=NF;i>1;i--){if(length($i)>3){tax=$i;break;}};print $1\".fna.gz\\t\"$1\"\\t\"tax}' > ganon_input_file.tsv # Build ganon database ganon build-custom --input-file ganon_input_file.tsv --db-prefix mgnify_human_oral_v101 --taxonomy gtdb --level leaves --threads 8 Note MGnify genomes catalogues will be build with GTDB taxonomy.","title":"MGnify genome catalogues (MAGs)"},{"location":"custom_databases/#pathogen-detection-fda-argos","text":"A collection of >1400 \"microbes that include biothreat microorganisms, common clinical pathogens and closely related species\". Article / Website / BioProject . # Download sequence files wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt grep \"strain=FDAARGOS\" assembly_summary_refseq.txt > fdaargos_assembly_summary.txt genome_updater.sh -e fdaargos_assembly_summary.txt -f \"genomic.fna.gz\" -o download -m -t 12 # Build ganon database ganon build-custom --input download/ --input-recursive --db-prefix fdaargos --ncbi-file-info download/assembly_summary.txt --level assembly --threads 32 Note The example above uses genome_updater to download files","title":"Pathogen detection FDA-ARGOS"},{"location":"custom_databases/#blast-databases-nt-env_nt-nt_prok","text":"BLAST databases. Website / FTP . Current available nucleotide databases (2024-02-09): 16S_ribosomal_RNA 18S_fungal_sequences 28S_fungal_sequences Betacoronavirus env_nt human_genome ITS_eukaryote_sequences ITS_RefSeq_Fungi LSU_eukaryote_rRNA LSU_prokaryote_rRNA mito mouse_genome nt nt_euk nt_others nt_prok nt_viruses patnt pdbnt ref_euk_rep_genomes ref_prok_rep_genomes refseq_rna refseq_select_rna ref_viroids_rep_genomes ref_viruses_rep_genomes SSU_eukaryote_rRNA tsa_nt List currently available entries curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"nucl-metadata.json\" | sed 's/-nucl-metadata.json/, /g' | sort Warning Some BLAST databases are very big and may require extreme computational resources to build. You may need to use some reduction strategies . The example shows how to download , parse and build a ganon database from BLAST database files. It does so by splitting the database into taxonomic specific files, to speed-up the build process: # Define BLAST db db=\"16S_ribosomal_RNA\" threads=8 # Download BLAST db - re-run this command many times until all finish (no more output) curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"^${db}\\..*tar.gz$\" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress \"https://ftp.ncbi.nlm.nih.gov/blast/db/{}\" # OPTIONAL Download and check MD5 wget -O - -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}\\.*tar.gz.md5\" > \"${db}.md5\" find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads:-1} -I{} md5sum {} > \"${db}_downloaded.md5\" diff -sy <(sort -k 2,2 \"${db}.md5\") <(sort -k 2,2 \"${db}_downloaded.md5\") # Should print \"Files /dev/fd/xx and /dev/fd/xx are identical\" # Extract BLAST db files, if successful, remove .tar.gz find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > \"${db}_extracted_files.txt\" # Create folder to write sequence files (split into 10 sub-folders) seq 0 9 | xargs -i mkdir -p \"${db}\"/{} # This command extracts sequences from the blastdb and writes them into taxid specific files # It also generates the --input-file for ganon with the fields: filepath file taxid blastdbcmd -entry all -db \"${db}\" -outfmt \"%a %T %s\" | \\ awk -v db=\"$(realpath ${db})\" '{file=db\"/\"substr($2,1,1)\"/\"$2\".fna\"; print \">\"$1\"\\n\"$3 >> file; print file\"\\t\"$2\".fna\\t\"$2}' | \\ sort | uniq > \"${db}_ganon_input_file.tsv\" # Build ganon database ganon build-custom --input-file \"${db}_ganon_input_file.tsv\" --db-prefix \"${db}\" --threads ${threads} --level leaves # Delete extracted files and auxiliary files cat \"${db}_extracted_files.txt\" | xargs rm rm \"${db}_extracted_files.txt\" \"${db}.md5\" \"${db}_downloaded.md5\" # Delete sequences and input_file rm -rf \"${db}\" \"${db}_ganon_input_file.tsv\" Note blastdbcmd is a command from BLAST+ software suite (tested version 2.14.0) and should be installed separately.","title":"BLAST databases (nt env_nt nt_prok ...)"},{"location":"custom_databases/#files-from-genome_updater","text":"To create a ganon database from files previously downloaded with genome_updater : ganon build-custom --input output_folder_genome_updater/version/ --input-recursive --db-prefix mydb --ncbi-file-info output_folder_genome_updater/assembly_summary.txt --level assembly --threads 32","title":"Files from genome_updater"},{"location":"custom_databases/#parameter-details","text":"","title":"Parameter details"},{"location":"custom_databases/#false-positive-and-size-max-fp-filter-size","text":"ganon indices are based on bloom filters and can have false positive matches. This can be controlled with --max-fp parameter. The lower the --max-fp , the less chances of false positives matches on classification, but the larger the database size will be. For example, with --max-fp 0.01 the database will be build so any target (defined by --level ) will have 1 in a 100 change of reporting a false k-mer match. The false positive of the query (all k-mers of a read) will be way lower, but directly affected by this value. Alternatively, one can set a specific size for the final index with --filter-size . When using this option, please observe the theoretic false positive of the index reported at the end of the building process.","title":"False positive and size (--max-fp, --filter-size)"},{"location":"custom_databases/#minimizers-window-size-kmer-size","text":"in ganon build , when --window-size > --kmer-size minimizers are used. That means that for a every window, a single k-mer will be selected. It produces smaller database files and requires substantially less memory overall. It may increase building times but will have a huge benefit for classification times. Sensitivity and precision can be reduced by small margins. If --window-size = --kmer-size , all k-mers are going to be used to build the database.","title":"minimizers (--window-size, --kmer-size)"},{"location":"custom_databases/#target-file-or-sequence-input-target","text":"This is a parameter that defines how ganon will parse your input files: - --input-target file (default) will consider every file provided with --input a single unit (e.g. multi-fasta files are considered one input, sequence headers ignored). - --input-target sequence will use every sequence as a unit. For this, ganon will first decompose every sequence in the input files provided with --input into a separated file. This will take longer and use more disk space. --input-target file is the default behavior and most efficient way to build databases. --input-target sequence should only be used when the input sequences are not separated by file (e.g. a single big FASTA file) or when classification at sequence level is desired.","title":"Target file or sequence (--input-target)"},{"location":"custom_databases/#build-level-level","text":"The --level parameter defines the max. depth of the database for classification. This parameter is relevant because the --max-fp is going to be guaranteed at the --level chosen. In ganon build the default value is species . In ganon build-custom the level will be the same as --input-target , meaning that classification will be done either at file or sequence level. Alternatively, --level assembly will link the file or sequence target information with assembly accessions retrieved from NCBI. --level leaves or --level species (or genus , family , ...) will link the targets with taxonomic information and prune the tree at the chosen level. --level custom will use specialization (4th col.) defined in the --input-file .","title":"Build level (--level)"},{"location":"custom_databases/#genome-sizes-genome-size-files","text":"Ganon will automatically download auxiliary files to define an approximate genome size for each entry in the taxonomic tree. For --taxonomy ncbi the species_genome_size.txt.gz is used. For --taxonomy gtdb the *_metadata.tar.gz files are used. Those files can be directly provided with the --genome-size-files argument. Genome sizes of parent nodes are calculated as the average of the respective children nodes. Other nodes without direct assigned genome sizes will use the closest parent with a pre-calculated genome size. The genome sizes are stored in the ganon database .","title":"Genome sizes (--genome-size-files)"},{"location":"custom_databases/#retrieving-info-ncbi-sequence-info-ncbi-file-info","text":"Further taxonomy and assembly linking information has to be collected to properly build the database. --ncbi-sequence-info and --ncbi-file-info allow customizations on this step. When --input-target sequence , --ncbi-sequence-info argument allows the use of NCBI e-utils webservices ( eutils ) or downloads accession2taxid files to extract target information (options nucl_gb nucl_wgs nucl_est nucl_gss pdb prot dead_nucl dead_wgs dead_prot ). By default, ganon uses eutils up-to 50000 input sequences, otherwise it downloads nucl_gb nucl_wgs from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/. Previously downloaded files can be directly provided with this argument. When --input-target file , --ncbi-file-info uses assembly_summary.txt from https://ftp.ncbi.nlm.nih.gov/genomes/ to extract target information (options refseq genbank refseq_historical genbank_historical . Previously downloaded files can be directly provided with this argument. If you are using outdated, removed or inactive assembly or sequence files and accessions from NCBI, make sure to include dead_nucl dead_wgs for --ncbi-sequence-info or refseq_historical genbank_historical for --ncbi-file-info . eutils option does not work with outdated accessions.","title":"Retrieving info (--ncbi-sequence-info, --ncbi-file-info)"},{"location":"default_databases/","text":"Databases :) ganon automates the download, update and build of databases based on NCBI RefSeq and GenBank genomes repositories wtih ganon build and update commands, for example: ganon build -g archaea bacteria -d arc_bac -c -t 30 This will download archaeal and bacterial complete genomes from RefSeq and build a database with 30 threads. Some day later, the database can be updated to include newest genomes with: ganon update -d arc_bac -t 30 Additionally, custom databases can be built with customized files and identifiers with the ganon build-custom command. Info We DO NOT provide pre-built indices for download. ganon can build databases very efficiently. This way, you will always have up-to-date reference sequences and get most out of your data. RefSeq and GenBank :) NCBI RefSeq and GenBank repositories are common resources to obtain reference sequences to analyze metagenomics data. They are mainly divided into domains/organism groups (e.g. archaea, bacteria, fungi, ...) but can be further filtered. The choice of those filters can drastically change the outcome of results. Commonly used sub-sets :) RefSeq (2023-03-14) # assemblies # species Size* ganon build All genomes 295219 52781 160 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs All genomes - 1 assembly/species 52781 52781 128 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_t1s Complete genomes 44121 19715 35 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_rs_cg Complete genomes - 1 assembly/species 19715 19715 29 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_cg_t1s Representative genomes 18073 18073 69 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --representative-genomes --db-prefix abfv_rs_rg GenBank (2023-03-14) # assemblies # species Size* ganon build All genomes 1595845 99505 - ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb All genomes - 1 assembly/species 99505 99505 300 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_gb_t1s Complete genomes 92917 34815 42 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg Complete genomes - 1 assembly/species 34815 34815 34 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes \"-A 'species:1'\" --db-prefix abfv_gb_cg_t1s Info Data obtained in 2023-03-14 for archaea, bacteria, fungi and viral groups only. By the time you are reading this, those numbers certainly grew a bit. The commands provided will download up-to-date assemblies and will require slightly larger resources. GTDB R214 # assemblies # species Size* ganon build All genomes 402709 85205 260 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --db-prefix ab_gtdb All genomes - 1 assembly/species 85205 85205 213 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --top 1 --db-prefix ab_gtdb_t1s Info GTDB covers only bacteria and archaea groups and has assemblies from RefSeq and GenBank. * in GB -> ganon requires up-to 2x the database size of memory to build it. The memory required to use it in classification is approx. the same as the database size. As a rule of thumb, the more the better, so choose the most comprehensive sub-set as possible given your computational resources It is possible to build databases that consume a fixed size/RAM usage. Beware that smaller filters will increase the false positive rates when classifying. Other approaches can reduce the size/RAM requirements with some trade-offs . Alternatively, you can build one database for each organism group separately and use them in ganon classify in any order or even stack them hierarchically . This way combination of multiple databases are possible, extending use cases. Further examples of commonly used database can be found here . Specific organisms or taxonomic groups :) It is also possible to generate databases for specific organisms or taxonomic branches with -a/--taxid , for example: ganon build --source refseq --taxid 562 317 --threads 48 --db-prefix coli_syringae will download and build a database for all Escherichia coli (taxid:562) and Pseudomonas syringae (taxid:317) assemblies from RefSeq. More filter options :) ganon uses genome_updater to manage downloads and further specific options and filters can be provided with the paramer -u/--genome-updater , for example: ganon build -g bacteria -t 48 -d bac_refseq --genome-updater \"-A 'genus:3' -E 20230101\" will download top 3 archaeal assemblies for each genus with date before 2023-01-01. For more information about genome_updater parameters, please check the repository . GTDB :) By default, ganon will use the NCBI Taxonomy to build the database. However, GTDB is fully supported and can be used with the parameter --taxonomy gtdb . Filtering by taxonomic entries also work with GTDB, for example: ganon build --db-prefix fuso_gtdb --taxid \"f__Fusobacteriaceae\" --source refseq genbank --taxonomy gtdb --threads 12 Update (ganon update) :) Default ganon databases generated with the ganon build can be updated with ganon update . This procedure will download new files and re-generate the ganon database with the updated entries. For example, a database generated with the following command: ganon build --db-prefix arc_cg_rs --source refseq --organism-group archaea --complete-genomes --threads 12 will contain all archaeal complete genomes from NCBI RefSeq at the time of running. Some days later, the database can be updated, fetching only new sequences added to the NCBI repository with the command: ganon update --db-prefix arc_cg_rs --threads 12 Tip To not overwrite the current database and create a new one with the updated files, use the --output-db-prefix parameter. Reproducibility :) If you use ganon with default databases and want to re-generate it later or keep track of the content for reproducibility purposes, you can save the assembly_summary.txt file located inside the {output_prefix}_files/ directory. To re-download the exact same snapshot of files used, one could use genome_updater , for example: genome_updater.sh -e assembly_summary.txt -f \"genomic.fna.gz\" -o recovered_files -m -t 12 Reducing database size :) Filter type (IBF and HIBF) :) The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates smaller databases with faster query times ( article ). However, the HIBF takes a little longer to build and has less flexibility regarding size and further options in ganon. You can choose which filter to use with the --filter-type parameter in ganon build and ganon build-custom . Due to differences between the default IBF used in ganon and the HIBF, it is recommended to lower the false positive when using the HIBF. The default value for high sensitivity is 1% ( --filter-type hibf --max-fp 0.001 ). Hint For large unbalanced reference sets, lots of reads to query -> HIBF (default) For quick database build and more flexibility -> IBF False positive rate :) A higher --max-fp value will generate a smaller database but with a higher number of false positive matches on classification. More details . Values between 0.001 (0.1%) and 0.3 (30%) are generally used. Hint When using higher --max-fp values, more false positive results may be generated. This can be filtered with the --fpr-query parameter in ganon classify k-mer and window size :) Define how much unique information is stored in the database. More details The smaller the --kmer-size , the less unique they will be, reducing database size but also sensitivity in classification. The bigger the --window-size , the less information needs to be stored resulting in smaller databases but with decrease classification accuracy. Top assemblies :) RefSeq and GenBank are highly biased toward some few organisms. This means that some species are highly represented in number of assemblies compared to others. This can not only bias analysis but also brings redundancy to the database. Choosing a certain number of top assemblies can mitigate those issues. Database sizes can also be drastically reduced without this redundancy, but \"strain-level\" analysis are then not possible. We recommend using top assemblies for larger and comprehensive reference sets (like the ones listed above ) and use the full set of assemblies for specific clade analysis. Example ganon build --top 1 will select one assembly for each taxonomic leaf (NCBI taxonomy still has strain, sub-species, ...) ganon build --genome-updater \"-A 'species:1'\" will select one assembly for each species ganon build --genome-updater \"-A 'genus:3'\" will select three assemblies for each genus Split databases :) Ganon allows classification with multiple databases in one level or in an hierarchy ( More details ). This means that databases can be built separately and used in any combination as desired. There are usually some benefits of doing so: Smaller databases when building by organism group, for example: one for bacteria, another for viruses, ... since average genome sizes are quite different. Easier to maintain and update. Extend use cases and avoid misclassification due to contaminated databases. Use databases as quality control, for example: remove reads matching one database of host or vectors (check out ganon report --skip-hierarchy ). Fixed size and Mode (only for --filter-type ibf) :) A fixed size for the database filter can be defined with --filter-size when using --filter-type ibf . The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. More details . --mode offers 5 different categories to build a database controlling the trade-off between size and classification speed. avg : Balanced mode smaller or smallest : create smaller databases with slower classification speed fast or fastest : create bigger databases with faster classification speed Warning If --filter-size is used, smaller and smallest refers to the false positive and not to the database size (which is fixed). Example :) Besides the benefits of using HIBF and specific sub-sets of big repositories shown on the default databases table , examples of other reduction strategies with IBF can be seen below: RefSeq archaeal complete genomes from 2023-05-05 Strategy Size (MB) Smaller Trade-off default 318 - - cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --db-prefix arc_rs_cg --filter-type ibf --mode smallest 301 5% Slower classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --mode smallest --db-prefix arc_rs_cg_smallest --filter-type ibf --filter-size 256 256 19% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --filter-size 256 --db-prefix arc_rs_cg_fs256 --filter-type ibf --window-size 35 249 21% Less sensitive classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --window-size 35 --db-prefix arc_rs_cg_ws35 --filter-type ibf --max-fp 0.2 190 40% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --max-fp 0.2 --db-prefix arc_rs_cg_fp0.2 --filter-type ibf Note This is an illustrative example and the reduction proportions for different configuration may be quite different","title":"Databases (ganon build)"},{"location":"default_databases/#databases","text":"ganon automates the download, update and build of databases based on NCBI RefSeq and GenBank genomes repositories wtih ganon build and update commands, for example: ganon build -g archaea bacteria -d arc_bac -c -t 30 This will download archaeal and bacterial complete genomes from RefSeq and build a database with 30 threads. Some day later, the database can be updated to include newest genomes with: ganon update -d arc_bac -t 30 Additionally, custom databases can be built with customized files and identifiers with the ganon build-custom command. Info We DO NOT provide pre-built indices for download. ganon can build databases very efficiently. This way, you will always have up-to-date reference sequences and get most out of your data.","title":"Databases"},{"location":"default_databases/#refseq-and-genbank","text":"NCBI RefSeq and GenBank repositories are common resources to obtain reference sequences to analyze metagenomics data. They are mainly divided into domains/organism groups (e.g. archaea, bacteria, fungi, ...) but can be further filtered. The choice of those filters can drastically change the outcome of results.","title":"RefSeq and GenBank"},{"location":"default_databases/#commonly-used-sub-sets","text":"RefSeq (2023-03-14) # assemblies # species Size* ganon build All genomes 295219 52781 160 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs All genomes - 1 assembly/species 52781 52781 128 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_t1s Complete genomes 44121 19715 35 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_rs_cg Complete genomes - 1 assembly/species 19715 19715 29 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_cg_t1s Representative genomes 18073 18073 69 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --representative-genomes --db-prefix abfv_rs_rg GenBank (2023-03-14) # assemblies # species Size* ganon build All genomes 1595845 99505 - ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb All genomes - 1 assembly/species 99505 99505 300 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_gb_t1s Complete genomes 92917 34815 42 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg Complete genomes - 1 assembly/species 34815 34815 34 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes \"-A 'species:1'\" --db-prefix abfv_gb_cg_t1s Info Data obtained in 2023-03-14 for archaea, bacteria, fungi and viral groups only. By the time you are reading this, those numbers certainly grew a bit. The commands provided will download up-to-date assemblies and will require slightly larger resources. GTDB R214 # assemblies # species Size* ganon build All genomes 402709 85205 260 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --db-prefix ab_gtdb All genomes - 1 assembly/species 85205 85205 213 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --top 1 --db-prefix ab_gtdb_t1s Info GTDB covers only bacteria and archaea groups and has assemblies from RefSeq and GenBank. * in GB -> ganon requires up-to 2x the database size of memory to build it. The memory required to use it in classification is approx. the same as the database size. As a rule of thumb, the more the better, so choose the most comprehensive sub-set as possible given your computational resources It is possible to build databases that consume a fixed size/RAM usage. Beware that smaller filters will increase the false positive rates when classifying. Other approaches can reduce the size/RAM requirements with some trade-offs . Alternatively, you can build one database for each organism group separately and use them in ganon classify in any order or even stack them hierarchically . This way combination of multiple databases are possible, extending use cases. Further examples of commonly used database can be found here .","title":"Commonly used sub-sets"},{"location":"default_databases/#specific-organisms-or-taxonomic-groups","text":"It is also possible to generate databases for specific organisms or taxonomic branches with -a/--taxid , for example: ganon build --source refseq --taxid 562 317 --threads 48 --db-prefix coli_syringae will download and build a database for all Escherichia coli (taxid:562) and Pseudomonas syringae (taxid:317) assemblies from RefSeq.","title":"Specific organisms or taxonomic groups"},{"location":"default_databases/#more-filter-options","text":"ganon uses genome_updater to manage downloads and further specific options and filters can be provided with the paramer -u/--genome-updater , for example: ganon build -g bacteria -t 48 -d bac_refseq --genome-updater \"-A 'genus:3' -E 20230101\" will download top 3 archaeal assemblies for each genus with date before 2023-01-01. For more information about genome_updater parameters, please check the repository .","title":"More filter options"},{"location":"default_databases/#gtdb","text":"By default, ganon will use the NCBI Taxonomy to build the database. However, GTDB is fully supported and can be used with the parameter --taxonomy gtdb . Filtering by taxonomic entries also work with GTDB, for example: ganon build --db-prefix fuso_gtdb --taxid \"f__Fusobacteriaceae\" --source refseq genbank --taxonomy gtdb --threads 12","title":"GTDB"},{"location":"default_databases/#update-ganon-update","text":"Default ganon databases generated with the ganon build can be updated with ganon update . This procedure will download new files and re-generate the ganon database with the updated entries. For example, a database generated with the following command: ganon build --db-prefix arc_cg_rs --source refseq --organism-group archaea --complete-genomes --threads 12 will contain all archaeal complete genomes from NCBI RefSeq at the time of running. Some days later, the database can be updated, fetching only new sequences added to the NCBI repository with the command: ganon update --db-prefix arc_cg_rs --threads 12 Tip To not overwrite the current database and create a new one with the updated files, use the --output-db-prefix parameter.","title":"Update (ganon update)"},{"location":"default_databases/#reproducibility","text":"If you use ganon with default databases and want to re-generate it later or keep track of the content for reproducibility purposes, you can save the assembly_summary.txt file located inside the {output_prefix}_files/ directory. To re-download the exact same snapshot of files used, one could use genome_updater , for example: genome_updater.sh -e assembly_summary.txt -f \"genomic.fna.gz\" -o recovered_files -m -t 12","title":"Reproducibility"},{"location":"default_databases/#reducing-database-size","text":"","title":"Reducing database size"},{"location":"default_databases/#filter-type-ibf-and-hibf","text":"The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates smaller databases with faster query times ( article ). However, the HIBF takes a little longer to build and has less flexibility regarding size and further options in ganon. You can choose which filter to use with the --filter-type parameter in ganon build and ganon build-custom . Due to differences between the default IBF used in ganon and the HIBF, it is recommended to lower the false positive when using the HIBF. The default value for high sensitivity is 1% ( --filter-type hibf --max-fp 0.001 ). Hint For large unbalanced reference sets, lots of reads to query -> HIBF (default) For quick database build and more flexibility -> IBF","title":"Filter type (IBF and HIBF)"},{"location":"default_databases/#false-positive-rate","text":"A higher --max-fp value will generate a smaller database but with a higher number of false positive matches on classification. More details . Values between 0.001 (0.1%) and 0.3 (30%) are generally used. Hint When using higher --max-fp values, more false positive results may be generated. This can be filtered with the --fpr-query parameter in ganon classify","title":"False positive rate"},{"location":"default_databases/#k-mer-and-window-size","text":"Define how much unique information is stored in the database. More details The smaller the --kmer-size , the less unique they will be, reducing database size but also sensitivity in classification. The bigger the --window-size , the less information needs to be stored resulting in smaller databases but with decrease classification accuracy.","title":"k-mer and window size"},{"location":"default_databases/#top-assemblies","text":"RefSeq and GenBank are highly biased toward some few organisms. This means that some species are highly represented in number of assemblies compared to others. This can not only bias analysis but also brings redundancy to the database. Choosing a certain number of top assemblies can mitigate those issues. Database sizes can also be drastically reduced without this redundancy, but \"strain-level\" analysis are then not possible. We recommend using top assemblies for larger and comprehensive reference sets (like the ones listed above ) and use the full set of assemblies for specific clade analysis. Example ganon build --top 1 will select one assembly for each taxonomic leaf (NCBI taxonomy still has strain, sub-species, ...) ganon build --genome-updater \"-A 'species:1'\" will select one assembly for each species ganon build --genome-updater \"-A 'genus:3'\" will select three assemblies for each genus","title":"Top assemblies"},{"location":"default_databases/#split-databases","text":"Ganon allows classification with multiple databases in one level or in an hierarchy ( More details ). This means that databases can be built separately and used in any combination as desired. There are usually some benefits of doing so: Smaller databases when building by organism group, for example: one for bacteria, another for viruses, ... since average genome sizes are quite different. Easier to maintain and update. Extend use cases and avoid misclassification due to contaminated databases. Use databases as quality control, for example: remove reads matching one database of host or vectors (check out ganon report --skip-hierarchy ).","title":"Split databases"},{"location":"default_databases/#fixed-size-and-mode-only-for-filter-type-ibf","text":"A fixed size for the database filter can be defined with --filter-size when using --filter-type ibf . The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. More details . --mode offers 5 different categories to build a database controlling the trade-off between size and classification speed. avg : Balanced mode smaller or smallest : create smaller databases with slower classification speed fast or fastest : create bigger databases with faster classification speed Warning If --filter-size is used, smaller and smallest refers to the false positive and not to the database size (which is fixed).","title":"Fixed size and Mode (only for --filter-type ibf)"},{"location":"default_databases/#example","text":"Besides the benefits of using HIBF and specific sub-sets of big repositories shown on the default databases table , examples of other reduction strategies with IBF can be seen below: RefSeq archaeal complete genomes from 2023-05-05 Strategy Size (MB) Smaller Trade-off default 318 - - cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --db-prefix arc_rs_cg --filter-type ibf --mode smallest 301 5% Slower classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --mode smallest --db-prefix arc_rs_cg_smallest --filter-type ibf --filter-size 256 256 19% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --filter-size 256 --db-prefix arc_rs_cg_fs256 --filter-type ibf --window-size 35 249 21% Less sensitive classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --window-size 35 --db-prefix arc_rs_cg_ws35 --filter-type ibf --max-fp 0.2 190 40% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --max-fp 0.2 --db-prefix arc_rs_cg_fp0.2 --filter-type ibf Note This is an illustrative example and the reduction proportions for different configuration may be quite different","title":"Example"},{"location":"outputfiles/","text":"Output files :) ganon build/build-custom/update :) Every run on ganon build , ganon build-custom or ganon update will generate the following database files: {prefix} .ibf/.hibf : main bloom filter index file, extension based on the --filter-type option. {prefix} .tax : taxonomy tree, only generated if --taxonomy is used (fields: target/node, parent, rank, name, genome size) . {prefix} _files/ : ( ganon build only) folder containing downloaded reference sequence and auxiliary files. Not necessary for classification. Keep this folder if the database will be update later. Otherwise it can be deleted. Warning Database files generated with version 1.2.0 or higher are not compatible with older versions. ganon classify :) {prefix} .tre : full report file (see below) {prefix} .rep : plain report of the run with only targets that received a match. Can be used to re-generate full reports (.tre) with ganon report . At the end prints 2 extra lines with #total_classified and #total_unclassified . Fields 1: hierarchy label 2: target 3: # total matches 4: # unique reads 5: # lca reads 6: rank 7: name {prefix} .one : output with one match for each classified read after EM or LCA algorithm. Only generated with --output-one active. If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.one (fields: read identifier, target, (max) k-mer/minimizer count) {prefix} .all : output with all matches for each read. Only generated with --output-all active Warning: file can be very large . If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.all (fields: read identifier, target, k-mer/minimizer count) ganon report :) {prefix} .tre : tab-separated tree-like report with cumulative counts and taxonomic lineage. There are several possible --report-type . More information on the different types of reports can be found here : abundance : will attempt to estimate taxonomic abundances by re-disributing read counts from LCA matches and correcting sequence abundance by approximate genome sizes. reads : sequence abundances , reports the proportion of sequences assigned to a taxa, each read classified is counted once. dist : like reads with read count re-distribution. corr : like reads with correction by genome size. matches : every match is reported to their original target, including multiple and shared matches. Each line in this report is a taxonomic entry (including the root node), with the following fields: col field obs example 1 rank phylum 2 target taxonomic id. or specialization (assembly id.) 562 3 lineage 1|131567|2|1224|28211|766|942|768|769 4 name Chromobacterium rhizoryzae 5 # unique number of reads that matched exclusively to this target 5 6 # shared number of reads with non-unique matches directly assigned to this target. Represents the LCA matches ( --report-type reads ), re-assigned matches ( --report-type abundance/dist ) or shared matches ( --report-type matches ) 10 7 # children number of unique and shared assignments to all children nodes of this target 20 8 # cumulative the sum of the unique, shared and children assignments up-to this target 35 9 % cumulative percentage of assignments or estimated relative abundance for --report-type abundance 43.24 The first line of the report file will show the number of unclassified reads (not for --report-type matches ) The CAMI challenge bioboxes profiling format is supported using --output-format bioboxes . In this format, only values for the percentage/abundance (col. 9) are reported. The root node and unclassified entries are omitted. The sum of cumulative assignments for the unclassified and root lines is 100%. The final cumulative sum of reads/matches may be under 100% if any filter is successfully applied and/or hierarchical selection is selected (keep/skip/split). For all report type but matches , only taxa that received direct read matches, either unique or by LCA assignment, are considered. Some reads may have only shared matches and will not be reported directly but will be accounted for on some parent level. To visualize those matches, create a report with --report-type matches or use directly the file {prefix} .rep . ganon table :) {output_file}: a tab-separated file with counts/percentages of taxa for multiple samples Examples of output files The main output file is the `{prefix}.tre` which will summarize the results: unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 superkingdom 2 1|2 Bacteria 0 0 97 97 97.97980 phylum 1239 1|2|1239 Firmicutes 0 0 57 57 57.57576 phylum 1224 1|2|1224 Proteobacteria 0 0 40 40 40.40404 class 91061 1|2|1239|91061 Bacilli 0 0 57 57 57.57576 class 28211 1|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 class 1236 1|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 1385 1|2|1239|91061|1385 Bacillales 0 0 57 57 57.57576 order 204458 1|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 order 72274 1|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 186822 1|2|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 family 76892 1|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 family 468 1|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 44249 1|2|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 genus 75 1|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 genus 469 1|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species 1406 1|2|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 species 366602 1|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 species 470 1|2|1224|1236|72274|468|469|470 Acinetobacter baumannii 12 0 0 12 12.12121 running `ganon classify` or `ganon report` with `--ranks all`, the output will show all ranks used for classification and presented sorted by lineage (also available with `ganon report --sort lineage`): unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 no rank 131567 1|131567 cellular organisms 0 0 97 97 97.97980 superkingdom 2 1|131567|2 Bacteria 0 0 97 97 97.97980 phylum 1224 1|131567|2|1224 Proteobacteria 0 0 40 40 40.40404 class 1236 1|131567|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 72274 1|131567|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 468 1|131567|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 469 1|131567|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species group 909768 1|131567|2|1224|1236|72274|468|469|909768 Acinetobacter calcoaceticus/baumannii complex 0 0 12 12 12.12121 species 470 1|131567|2|1224|1236|72274|468|469|909768|470 Acinetobacter baumannii 12 0 0 12 12.12121 class 28211 1|131567|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 order 204458 1|131567|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 family 76892 1|131567|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 genus 75 1|131567|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 species 366602 1|131567|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 no rank 1783272 1|131567|2|1783272 Terrabacteria group 0 0 57 57 57.57576 phylum 1239 1|131567|2|1783272|1239 Firmicutes 0 0 57 57 57.57576 class 91061 1|131567|2|1783272|1239|91061 Bacilli 0 0 57 57 57.57576 order 1385 1|131567|2|1783272|1239|91061|1385 Bacillales 0 0 57 57 57.57576 family 186822 1|131567|2|1783272|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 genus 44249 1|131567|2|1783272|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 species 1406 1|131567|2|1783272|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 with `--output-format bioboxes` @Version:0.10.0 @SampleID:example.rep H1 @Ranks:superkingdom|phylum|class|order|family|genus|species|assembly @Taxonomy:db.tax @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 100.00000 1224 phylum 2|1224 Bacteria|Proteobacteria 56.89782 201174 phylum 2|201174 Bacteria|Actinobacteria 21.84869 1239 phylum 2|1239 Bacteria|Firmicutes 9.75197 976 phylum 2|976 Bacteria|Bacteroidota 6.15297 1117 phylum 2|1117 Bacteria|Cyanobacteria 2.23146 203682 phylum 2|203682 Bacteria|Planctomycetota 1.23353 57723 phylum 2|57723 Bacteria|Acidobacteria 0.52549 200795 phylum 2|200795 Bacteria|Chloroflexi 0.31118","title":"Output files"},{"location":"outputfiles/#output-files","text":"","title":"Output files"},{"location":"outputfiles/#ganon-buildbuild-customupdate","text":"Every run on ganon build , ganon build-custom or ganon update will generate the following database files: {prefix} .ibf/.hibf : main bloom filter index file, extension based on the --filter-type option. {prefix} .tax : taxonomy tree, only generated if --taxonomy is used (fields: target/node, parent, rank, name, genome size) . {prefix} _files/ : ( ganon build only) folder containing downloaded reference sequence and auxiliary files. Not necessary for classification. Keep this folder if the database will be update later. Otherwise it can be deleted. Warning Database files generated with version 1.2.0 or higher are not compatible with older versions.","title":"ganon build/build-custom/update"},{"location":"outputfiles/#ganon-classify","text":"{prefix} .tre : full report file (see below) {prefix} .rep : plain report of the run with only targets that received a match. Can be used to re-generate full reports (.tre) with ganon report . At the end prints 2 extra lines with #total_classified and #total_unclassified . Fields 1: hierarchy label 2: target 3: # total matches 4: # unique reads 5: # lca reads 6: rank 7: name {prefix} .one : output with one match for each classified read after EM or LCA algorithm. Only generated with --output-one active. If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.one (fields: read identifier, target, (max) k-mer/minimizer count) {prefix} .all : output with all matches for each read. Only generated with --output-all active Warning: file can be very large . If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.all (fields: read identifier, target, k-mer/minimizer count)","title":"ganon classify"},{"location":"outputfiles/#ganon-report","text":"{prefix} .tre : tab-separated tree-like report with cumulative counts and taxonomic lineage. There are several possible --report-type . More information on the different types of reports can be found here : abundance : will attempt to estimate taxonomic abundances by re-disributing read counts from LCA matches and correcting sequence abundance by approximate genome sizes. reads : sequence abundances , reports the proportion of sequences assigned to a taxa, each read classified is counted once. dist : like reads with read count re-distribution. corr : like reads with correction by genome size. matches : every match is reported to their original target, including multiple and shared matches. Each line in this report is a taxonomic entry (including the root node), with the following fields: col field obs example 1 rank phylum 2 target taxonomic id. or specialization (assembly id.) 562 3 lineage 1|131567|2|1224|28211|766|942|768|769 4 name Chromobacterium rhizoryzae 5 # unique number of reads that matched exclusively to this target 5 6 # shared number of reads with non-unique matches directly assigned to this target. Represents the LCA matches ( --report-type reads ), re-assigned matches ( --report-type abundance/dist ) or shared matches ( --report-type matches ) 10 7 # children number of unique and shared assignments to all children nodes of this target 20 8 # cumulative the sum of the unique, shared and children assignments up-to this target 35 9 % cumulative percentage of assignments or estimated relative abundance for --report-type abundance 43.24 The first line of the report file will show the number of unclassified reads (not for --report-type matches ) The CAMI challenge bioboxes profiling format is supported using --output-format bioboxes . In this format, only values for the percentage/abundance (col. 9) are reported. The root node and unclassified entries are omitted. The sum of cumulative assignments for the unclassified and root lines is 100%. The final cumulative sum of reads/matches may be under 100% if any filter is successfully applied and/or hierarchical selection is selected (keep/skip/split). For all report type but matches , only taxa that received direct read matches, either unique or by LCA assignment, are considered. Some reads may have only shared matches and will not be reported directly but will be accounted for on some parent level. To visualize those matches, create a report with --report-type matches or use directly the file {prefix} .rep .","title":"ganon report"},{"location":"outputfiles/#ganon-table","text":"{output_file}: a tab-separated file with counts/percentages of taxa for multiple samples Examples of output files The main output file is the `{prefix}.tre` which will summarize the results: unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 superkingdom 2 1|2 Bacteria 0 0 97 97 97.97980 phylum 1239 1|2|1239 Firmicutes 0 0 57 57 57.57576 phylum 1224 1|2|1224 Proteobacteria 0 0 40 40 40.40404 class 91061 1|2|1239|91061 Bacilli 0 0 57 57 57.57576 class 28211 1|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 class 1236 1|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 1385 1|2|1239|91061|1385 Bacillales 0 0 57 57 57.57576 order 204458 1|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 order 72274 1|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 186822 1|2|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 family 76892 1|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 family 468 1|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 44249 1|2|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 genus 75 1|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 genus 469 1|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species 1406 1|2|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 species 366602 1|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 species 470 1|2|1224|1236|72274|468|469|470 Acinetobacter baumannii 12 0 0 12 12.12121 running `ganon classify` or `ganon report` with `--ranks all`, the output will show all ranks used for classification and presented sorted by lineage (also available with `ganon report --sort lineage`): unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 no rank 131567 1|131567 cellular organisms 0 0 97 97 97.97980 superkingdom 2 1|131567|2 Bacteria 0 0 97 97 97.97980 phylum 1224 1|131567|2|1224 Proteobacteria 0 0 40 40 40.40404 class 1236 1|131567|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 72274 1|131567|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 468 1|131567|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 469 1|131567|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species group 909768 1|131567|2|1224|1236|72274|468|469|909768 Acinetobacter calcoaceticus/baumannii complex 0 0 12 12 12.12121 species 470 1|131567|2|1224|1236|72274|468|469|909768|470 Acinetobacter baumannii 12 0 0 12 12.12121 class 28211 1|131567|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 order 204458 1|131567|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 family 76892 1|131567|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 genus 75 1|131567|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 species 366602 1|131567|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 no rank 1783272 1|131567|2|1783272 Terrabacteria group 0 0 57 57 57.57576 phylum 1239 1|131567|2|1783272|1239 Firmicutes 0 0 57 57 57.57576 class 91061 1|131567|2|1783272|1239|91061 Bacilli 0 0 57 57 57.57576 order 1385 1|131567|2|1783272|1239|91061|1385 Bacillales 0 0 57 57 57.57576 family 186822 1|131567|2|1783272|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 genus 44249 1|131567|2|1783272|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 species 1406 1|131567|2|1783272|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 with `--output-format bioboxes` @Version:0.10.0 @SampleID:example.rep H1 @Ranks:superkingdom|phylum|class|order|family|genus|species|assembly @Taxonomy:db.tax @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 100.00000 1224 phylum 2|1224 Bacteria|Proteobacteria 56.89782 201174 phylum 2|201174 Bacteria|Actinobacteria 21.84869 1239 phylum 2|1239 Bacteria|Firmicutes 9.75197 976 phylum 2|976 Bacteria|Bacteroidota 6.15297 1117 phylum 2|1117 Bacteria|Cyanobacteria 2.23146 203682 phylum 2|203682 Bacteria|Planctomycetota 1.23353 57723 phylum 2|57723 Bacteria|Acidobacteria 0.52549 200795 phylum 2|200795 Bacteria|Chloroflexi 0.31118","title":"ganon table"},{"location":"reports/","text":"Reports :) ganon report filters and generates several reports and summaries from the results obtained with ganon classify . It is possible to summarize the results in terms of taxonomic and sequence abundances as well as total number of matches. Examples :) Given the output .rep from ganon classify and the database used ( --db-prefix ): Taxonomic profile with abundance estimation (default) :) ganon report --db-prefix mydb --input results.rep --output-prefix tax_profile --report-type abundance Sequence profile :) ganon report --db-prefix mydb --input results.rep --output-prefix seq_profile --report-type reads Matches profile :) ganon report --db-prefix mydb --input results.rep --output-prefix matches --report-type matches Filtering results :) ganon report --db-prefix mydb --input results.rep --output-prefix filtered --min-count 0.0005 --top-percentile 0.8 This will keep only results with a min. abundance of 0.05% and only the top 80% most abundant. Parameter details :) report type (--report-type) :) Several reports are available with --report-type : reads , abundance , dist , corr , matches : reads reports sequence abundances which are the basic proportion of reads classified in the sample. abundance will convert sequence abundance into taxonomic abundances by re-distributing read counts among leaf nodes and correcting by genome size. The re-distribution applies for reads classified with a LCA assignment and it is proportional to the number of unique matches of leaf nodes available in the ganon database (relative to the LCA node). Genome size is estimated based on NCBI or GTDB auxiliary files . Genome size correction is applied by rank based on default ranks only (superkingdom phylum class order family genus species assembly). Read counts in intermediate ranks will be corrected based on the closest parent default rank and re-assigned to its original rank. dist is the same of reads with read count re-distribution corr is the same of reads with correction by genome size matches will report the total number of matches classified, either unique or shared. This option will output the total number of matches instead the total number of reads","title":"Reports (ganon report)"},{"location":"reports/#reports","text":"ganon report filters and generates several reports and summaries from the results obtained with ganon classify . It is possible to summarize the results in terms of taxonomic and sequence abundances as well as total number of matches.","title":"Reports"},{"location":"reports/#examples","text":"Given the output .rep from ganon classify and the database used ( --db-prefix ):","title":"Examples"},{"location":"reports/#taxonomic-profile-with-abundance-estimation-default","text":"ganon report --db-prefix mydb --input results.rep --output-prefix tax_profile --report-type abundance","title":"Taxonomic profile with abundance estimation (default)"},{"location":"reports/#sequence-profile","text":"ganon report --db-prefix mydb --input results.rep --output-prefix seq_profile --report-type reads","title":"Sequence profile"},{"location":"reports/#matches-profile","text":"ganon report --db-prefix mydb --input results.rep --output-prefix matches --report-type matches","title":"Matches profile"},{"location":"reports/#filtering-results","text":"ganon report --db-prefix mydb --input results.rep --output-prefix filtered --min-count 0.0005 --top-percentile 0.8 This will keep only results with a min. abundance of 0.05% and only the top 80% most abundant.","title":"Filtering results"},{"location":"reports/#parameter-details","text":"","title":"Parameter details"},{"location":"reports/#report-type-report-type","text":"Several reports are available with --report-type : reads , abundance , dist , corr , matches : reads reports sequence abundances which are the basic proportion of reads classified in the sample. abundance will convert sequence abundance into taxonomic abundances by re-distributing read counts among leaf nodes and correcting by genome size. The re-distribution applies for reads classified with a LCA assignment and it is proportional to the number of unique matches of leaf nodes available in the ganon database (relative to the LCA node). Genome size is estimated based on NCBI or GTDB auxiliary files . Genome size correction is applied by rank based on default ranks only (superkingdom phylum class order family genus species assembly). Read counts in intermediate ranks will be corrected based on the closest parent default rank and re-assigned to its original rank. dist is the same of reads with read count re-distribution corr is the same of reads with correction by genome size matches will report the total number of matches classified, either unique or shared. This option will output the total number of matches instead the total number of reads","title":"report type (--report-type)"},{"location":"start/","text":"Quick Start Guide :) Install :) conda install -c bioconda -c conda-forge ganon Download and Build a database :) Bacteria - NCBI RefSeq - representative genomes ganon build --db-prefix bac_rs_rg --source refseq --organism-group bacteria --representative-genomes --threads 24 If you want to test ganon functionalities with a smaller database, use archaea instead of bacteria in the example above. Classify and generate a tax. profile :) Download test reads ganon classify --db-prefix bac_rs_rg --output-prefix classify_results --single-reads H01_1M_0.1.fq.gz --threads 24 classify_results.tre -> taxonomic profile Important parameters :) The most important parameters and trade-offs to be aware of when using ganon: ganon build :) --level : Highest level to build the database. Can be a taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis. The more specific the level, the bigger the database will be. --max-fp : controls the false positive of the bloom filters. The higher the --max-fp , the smaller the databases at a cost of sensitivity in classification. --window-size --kmer-size : the window value should always be the same or larger than the k-mer value. The larger the difference between them, the smaller the database will be. However, some sensitivity/precision loss in classification is expected with small k-mer and/or large window . Larger k-mer values (e.g. 31 ) will improve classification, specially read binning, at a cost of larger databases. ganon classify :) --rel-cutoff : defines the min. percentage of k-mers shared to a reference to consider a match. Higher values will improve precision and decrease sensitivity. For taxonomic profiling, a higher value between 0.4 and 0.8 may provide better results. For read binning, lower values between 0.2 and 0.4 are recommended. lower values -> more read matches higher values -> less read matches --rel-filter : filter matches in relation to the best and worst after the cutoff is applied. 0 means only matches with top score (# of k-mers ) as the best match will be kept. lower values -> more unique matching reads higher values -> more multi-matching reads --multiple-matches : defines how ganon treats multiple-matching reads. Either by an EM-algorithm based on unique matches or a taxonomy-based LCA algorithm. ganon report :) --report-type : reports either taxonomic, sequence or matches abundances. Use corr or abundance for taxonomic profiling, reads or dist for sequence profiling and matches to report a summary of all matches. --min-count : cutoff to discard underrepresented taxa. Useful to remove the common long tail of spurious matches and false positives when performing classification. Values between 0.0001 (0.01%) and 0.001 (0.1%) improved sensitivity and precision in our evaluations. The higher the value, the more precise the outcome, with a sensitivity loss. Alternatively --top-percentile can be used to keep a relative amount of taxa instead a hard cutoff. The numeric values above are averages from several experiments with different sample types and database contents. They may not work as expected for your data. If you are not sure which values to use or see something unexpected, please open an issue .","title":"Quick Start"},{"location":"start/#quick-start-guide","text":"","title":"Quick Start Guide"},{"location":"start/#install","text":"conda install -c bioconda -c conda-forge ganon","title":"Install"},{"location":"start/#download-and-build-a-database","text":"Bacteria - NCBI RefSeq - representative genomes ganon build --db-prefix bac_rs_rg --source refseq --organism-group bacteria --representative-genomes --threads 24 If you want to test ganon functionalities with a smaller database, use archaea instead of bacteria in the example above.","title":"Download and Build a database"},{"location":"start/#classify-and-generate-a-tax-profile","text":"Download test reads ganon classify --db-prefix bac_rs_rg --output-prefix classify_results --single-reads H01_1M_0.1.fq.gz --threads 24 classify_results.tre -> taxonomic profile","title":"Classify and generate a tax. profile"},{"location":"start/#important-parameters","text":"The most important parameters and trade-offs to be aware of when using ganon:","title":"Important parameters"},{"location":"start/#ganon-build","text":"--level : Highest level to build the database. Can be a taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis. The more specific the level, the bigger the database will be. --max-fp : controls the false positive of the bloom filters. The higher the --max-fp , the smaller the databases at a cost of sensitivity in classification. --window-size --kmer-size : the window value should always be the same or larger than the k-mer value. The larger the difference between them, the smaller the database will be. However, some sensitivity/precision loss in classification is expected with small k-mer and/or large window . Larger k-mer values (e.g. 31 ) will improve classification, specially read binning, at a cost of larger databases.","title":"ganon build"},{"location":"start/#ganon-classify","text":"--rel-cutoff : defines the min. percentage of k-mers shared to a reference to consider a match. Higher values will improve precision and decrease sensitivity. For taxonomic profiling, a higher value between 0.4 and 0.8 may provide better results. For read binning, lower values between 0.2 and 0.4 are recommended. lower values -> more read matches higher values -> less read matches --rel-filter : filter matches in relation to the best and worst after the cutoff is applied. 0 means only matches with top score (# of k-mers ) as the best match will be kept. lower values -> more unique matching reads higher values -> more multi-matching reads --multiple-matches : defines how ganon treats multiple-matching reads. Either by an EM-algorithm based on unique matches or a taxonomy-based LCA algorithm.","title":"ganon classify"},{"location":"start/#ganon-report","text":"--report-type : reports either taxonomic, sequence or matches abundances. Use corr or abundance for taxonomic profiling, reads or dist for sequence profiling and matches to report a summary of all matches. --min-count : cutoff to discard underrepresented taxa. Useful to remove the common long tail of spurious matches and false positives when performing classification. Values between 0.0001 (0.01%) and 0.001 (0.1%) improved sensitivity and precision in our evaluations. The higher the value, the more precise the outcome, with a sensitivity loss. Alternatively --top-percentile can be used to keep a relative amount of taxa instead a hard cutoff. The numeric values above are averages from several experiments with different sample types and database contents. They may not work as expected for your data. If you are not sure which values to use or see something unexpected, please open an issue .","title":"ganon report"},{"location":"table/","text":"Table :) ganon table filters and summarizes several reports obtained with ganon report into a table. Filters for each sample or for averages among all samples can also be applied. Examples :) Given several .tre from ganon report : Counts of species :) ganon table --input *.tre --output-file table.tsv --rank species Abundance of species :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species Top 10 species (among all samples) :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-all 10 Top 10 species (from each samples) :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-sample 10 Filtering results :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --min-count 0.0005 This will keep only results with a min. abundance of 0.05% .","title":"Table (ganon table)"},{"location":"table/#table","text":"ganon table filters and summarizes several reports obtained with ganon report into a table. Filters for each sample or for averages among all samples can also be applied.","title":"Table"},{"location":"table/#examples","text":"Given several .tre from ganon report :","title":"Examples"},{"location":"table/#counts-of-species","text":"ganon table --input *.tre --output-file table.tsv --rank species","title":"Counts of species"},{"location":"table/#abundance-of-species","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species","title":"Abundance of species"},{"location":"table/#top-10-species-among-all-samples","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-all 10","title":"Top 10 species (among all samples)"},{"location":"table/#top-10-species-from-each-samples","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-sample 10","title":"Top 10 species (from each samples)"},{"location":"table/#filtering-results","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --min-count 0.0005 This will keep only results with a min. abundance of 0.05% .","title":"Filtering results"},{"location":"tutorials/","text":"Tutorials :) ... soon ...","title":"Tutorials"},{"location":"tutorials/#tutorials","text":"... soon ...","title":"Tutorials"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index cb272fa6..63ea9dbf 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ
python3 -m unittest discover -s tests/ganon/integration/
+python3 -m pip install "parameterized>=0.9.0" # Alternative: conda install -c conda-forge "parameterized>=0.9.0"
+python3 -m unittest discover -s tests/ganon/integration/
python3 -m unittest discover -s tests/ganon/integration_online/ # optional - downloads large files
cd build/
ctest -VV .
@@ -246,7 +250,7 @@ ParametersParametersParameters=3.6 pandas >=1.1.0 multitax >=1.3.1 # Python version should be >=3.6 python3 -V # Install packages via pip or conda: # PIP python3 -m pip install \"pandas>=1.1.0\" \"multitax>=1.3.1\" # Conda (alternative) conda install \"pandas>=1.1.0\" \"multitax>=1.3.1\" C++ dependencies :) GCC >=11 CMake >=3.4 zlib bzip2 raptor >=3.0.1 Tip If your system has GCC version 10 or below, you can create an environment with the latest conda-forge GCC version and dependencies: conda create -c conda-forge -n gcc-conda gcc gxx zlib bzip2 cmake and activate the environment with: source activate gcc-conda . In CMake, you may have set the environment include directory with the following parameter: -DSEQAN3_CXX_FLAGS=\"-I/path/to/miniconda3/envs/gcc-conda/include/\" changing /path/to/miniconda3 with your local path to the conda installation. Downloading and building ganon + submodules :) git clone --recurse-submodules https://github.com/pirovc/ganon.git # Install Python side cd ganon python3 setup.py install --record files.txt # optional # Compile and install C++ side mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DVERBOSE_CONFIG=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCONDA=OFF -DLONGREADS=OFF .. make -j 4 sudo make install # optional to change install location (e.g. /myprefix/bin/ ), set the installation prefix in the cmake command with -DCMAKE_INSTALL_PREFIX=/myprefix/ use -DINCLUDE_DIRS to set alternative paths to cxxopts and Catch2 libs. to classify extremely large reads or contigs that would need more than 65000 k-mers, use -DLONGREADS=ON Installing raptor :) The easiest way to install raptor is via conda with conda install -c bioconda -c conda-forge \"raptor>=3.0.1\" (already included in ganon install via conda). Note raptor is required to build databases with the Hierarchical Interleaved Bloom Filter ( ganon build --filter-type hibf ) To build old style ganon indices ganon build --filter-type ibf , raptor is not required To install raptor from source, follow the instructions below: Dependencies :) CMake >= 3.18 GCC 11, 12 or 13 (most recent minor version) Downloading and building raptor + submodules :) git clone --branch raptor-v3.0.1 --recurse-submodules https://github.com/seqan/raptor cd raptor mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=\"-std=c++23 -Wno-interference-size\" .. make -j 4 binaries will be located in the bin directory you may have to inform ganon build the path to the binaries with --raptor-path raptor/build/bin Testing :) If everything was properly installed, the following command should show the help pages without errors: ganon -h Running tests :) python3 -m unittest discover -s tests/ganon/integration/ python3 -m unittest discover -s tests/ganon/integration_online/ # optional - downloads large files cd build/ ctest -VV . Parameters :) usage: ganon [-h] [-v] {build,build-custom,update,classify,reassign,report,table} ... - - - - - - - - - - _ _ _ _ _ (_|(_|| |(_)| | _| v. 2.0.1 - - - - - - - - - - positional arguments: {build,build-custom,update,classify,reassign,report,table} build Download and build ganon default databases (refseq/genbank) build-custom Build custom ganon databases update Update ganon default databases classify Classify reads against built databases reassign Reassign reads with multiple matches with an EM algorithm report Generate reports from classification results table Generate table from reports options: -h, --help show this help message and exit -v, --version Show program's version number and exit. ganon build usage: ganon build [-h] [-g [...]] [-a [...]] [-l] [-b [...]] [-o] [-c] [-r] [-u] [-m [...]] [-z [...]] [--skip-genome-size] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -g [ ...], --organism-group [ ...] One or more organism groups to download [archaea, bacteria, fungi, human, invertebrate, metagenomes, other, plant, protozoa, vertebrate_mammalian, vertebrate_other, viral]. Mutually exclusive --taxid (default: None) -a [ ...], --taxid [ ...] One or more taxonomic identifiers to download. e.g. 562 (-x ncbi) or 's__Escherichia coli' (-x gtdb). Mutually exclusive --organism-group (default: None) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) database arguments: -l , --level Highest level to build the database. Options: any available taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis (default: species) download arguments: -b [ ...], --source [ ...] Source to download [refseq, genbank] (default: ['refseq']) -o , --top Download limited assemblies for each taxa. 0 for all. (default: 0) -c, --complete-genomes Download only sub-set of complete genomes (default: False) -r, --representative-genomes Download only sub-set of representative genomes (default: False) -u , --genome-updater Additional genome_updater parameters (https://github.com/pirovc/genome_updater) (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon build-custom usage: ganon build-custom [-h] [-i [...]] [-e] [-c] [-n] [-a] [-l] [-m [...]] [-z [...]] [--skip-genome-size] [-r [...]] [-q [...]] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). Mutually exclusive --input-file. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: fna.gz) -c, --input-recursive Look for files recursively in folder(s) provided with --input (default: False) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) custom arguments: -n , --input-file Manually set information for input files: file [target node specialization specialization name]. target is the sequence identifier if --input-target sequence (file can be repeated for multiple sequences). if --input-target file and target is not set, filename is used. node is the taxonomic identifier. Mutually exclusive --input (default: None) -a , --input-target Target to use [file, sequence]. By default: 'file' if multiple input files are provided or --input-file is set, 'sequence' if a single file is provided. Using 'file' is recommended and will speed-up the building process (default: None) -l , --level Use a specialized target to build the database. By default, --level is the --input-target. Options: any available taxonomic rank [species, genus, ...] or 'leaves' (requires --taxonomy). Further specialization options [assembly, custom]. assembly will retrieve and use the assembly accession and name. custom requires and uses the specialization field in the --input-file. (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) ncbi arguments: -r [ ...], --ncbi-sequence-info [ ...] Uses NCBI e-utils webservices or downloads accession2taxid files to extract target information. [eutils, nucl_gb, nucl_wgs, nucl_est, nucl_gss, pdb, prot, dead_nucl, dead_wgs, dead_prot or one or more accession2taxid files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/]. By default uses e-utils up-to 50000 sequences or downloads nucl_gb nucl_wgs otherwise. (default: []) -q [ ...], --ncbi-file-info [ ...] Downloads assembly_summary files to extract target information. [refseq, genbank, refseq_historical, genbank_historical or one or more assembly_summary files from https://ftp.ncbi.nlm.nih.gov/genomes/] (default: ['refseq', 'genbank']) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon update usage: ganon update [-h] -d DB_PREFIX [-o] [-t] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -d DB_PREFIX, --db-prefix DB_PREFIX Existing database input prefix (default: None) important arguments: -o , --output-db-prefix Output database prefix. By default will be the same as --db-prefix and overwrite files (default: None) -t , --threads optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon classify usage: ganon classify [-h] -d [DB_PREFIX ...] [-s [reads.fq[.gz] ...]] [-p [reads.1.fq[.gz] reads.2.fq[.gz] ...]] [-c [...]] [-e [...]] [-m] [--ranks [...]] [--min-count] [--report-type] [--skip-report] [-o] [--output-one] [--output-all] [--output-unclassified] [--output-single] [-t] [-b] [-f [...]] [-l [...]] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -d [DB_PREFIX ...], --db-prefix [DB_PREFIX ...] Database input prefix[es] (default: None) -s [reads.fq[.gz] ...], --single-reads [reads.fq[.gz] ...] Multi-fastq[.gz] file[s] to classify (default: None) -p [reads.1.fq[.gz] reads.2.fq[.gz] ...], --paired-reads [reads.1.fq[.gz] reads.2.fq[.gz] ...] Multi-fastq[.gz] pairs of file[s] to classify (default: None) cutoff/filter arguments: -c [ ...], --rel-cutoff [ ...] Min. percentage of a read (set of k-mers) shared with a reference necessary to consider a match. Generally used to remove low similarity matches. Single value or one per database (e.g. 0.7 1 0.25). 0 for no cutoff (default: [0.75]) -e [ ...], --rel-filter [ ...] Additional relative percentage of matches (relative to the best match) to keep. Generally used to keep top matches above cutoff. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [0.1]) post-processing/report arguments: -m , --multiple-matches Method to solve reads with multiple matches [em, lca, skip]. em -> expectation maximization algorithm based on unique matches. lca -> lowest common ancestor based on taxonomy. The EM algorithm can be executed later with 'ganon reassign' using the .all file (--output-all). (default: em) --ranks [ ...] Ranks to report taxonomic abundances (.tre). empty will report default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) --min-count Minimum percentage/counts to report an taxa (.tre) [use values between 0-1 for percentage, >1 for counts] (default: 5e-05) --report-type Type of report (.tre) [abundance, reads, matches, dist, corr]. More info in 'ganon report'. (default: abundance) --skip-report Disable tree-like report (.tre) at the end of classification. Can be done later with 'ganon report'. (default: False) output arguments: -o , --output-prefix Output prefix for output (.rep) and tree-like report (.tre). Empty to output to STDOUT (only .rep) (default: None) --output-one Output a file with one match for each read (.one) either an unique match or a result from the EM or a LCA algorithm (--multiple-matches) (default: False) --output-all Output a file with all unique and multiple matches (.all) (default: False) --output-unclassified Output a file with unclassified read headers (.unc) (default: False) --output-single When using multiple hierarchical levels, output everything in one file instead of one per hierarchy (default: False) other arguments: -t , --threads Number of sub-processes/threads to use (default: 1) -b, --binning Optimized parameters for binning (--rel-cutoff 0.25 --rel-filter 0 --min-count 0 --report-type reads). Will report sequence abundances (.tre) instead of tax. abundance. (default: False) -f [ ...], --fpr-query [ ...] Max. false positive of a query to accept a match. Applied after --rel-cutoff and --rel-filter. Generally used to remove false positives matches querying a database build with large --max-fp. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [1e-05]) -l [ ...], --hierarchy-labels [ ...] Hierarchy definition of --db-prefix files to be classified. Can also be a string, but input will be sorted to define order (e.g. 1 1 2 3). The default value reported without hierarchy is 'H1' (default: None) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon reassign usage: ganon reassign [-h] -i -o OUTPUT_PREFIX [-e] [-s] [--remove-all] [--skip-one] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -i , --input-prefix Input prefix to find files from ganon classify (.all and optionally .rep) (default: None) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for reassigned file (.one and optionally .rep). In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.out' (default: None) EM arguments: -e , --max-iter Max. number of iterations for the EM algorithm. If 0, will run until convergence (check --threshold) (default: 10) -s , --threshold Convergence threshold limit to stop the EM algorithm. (default: 0) other arguments: --remove-all Remove input file (.all) after processing. (default: False) --skip-one Do not write output file (.one) after processing. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon report usage: ganon report [-h] -i [...] [-e INPUT_EXTENSION] -o OUTPUT_PREFIX [-d [...]] [-x] [-m [...]] [-z [...]] [--skip-genome-size] [-f] [-t] [-r [...]] [-s] [-a] [-y] [-p [...]] [-k [...]] [-c] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.rep' file(s) from ganon classify. (default: None) -e INPUT_EXTENSION, --input-extension INPUT_EXTENSION Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: rep) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for report file 'output_prefix.tre'. In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.tre' (default: None) db/tax arguments: -d [ ...], --db-prefix [ ...] Database prefix(es) used for classification. Only '.tax' file(s) are required. If not provided, new taxonomy will be downloaded. Mutually exclusive with --taxonomy. (default: []) -x , --taxonomy Taxonomy database to use [ncbi, gtdb, skip]. Mutually exclusive with --db-prefix. (default: ncbi) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Valid only without --db-prefix. Activate this option when using sequences not representing full genomes. (default: False) output arguments: -f , --output-format Output format [text, tsv, csv, bioboxes]. text outputs a tabulated formatted text file for better visualization. bioboxes is the the CAMI challenge profiling format (only percentage/abundances are reported). (default: tsv) -t , --report-type Type of report [abundance, reads, matches, dist, corr]. 'abundance' -> tax. abundance (re- distribute read counts and correct by genome size), 'reads' -> sequence abundance, 'matches' -> report all unique and shared matches, 'dist' -> like reads with re-distribution of shared read counts only, 'corr' -> like abundance without re-distribution of shared read counts (default: abundance) -r [ ...], --ranks [ ...] Ranks to report ['', 'all', custom list]. 'all' for all possible ranks. empty for default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) -s , --sort Sort report by [rank, lineage, count, unique]. Default: rank (with custom --ranks) or lineage (with --ranks all) (default: ) -a, --no-orphan Omit orphan nodes from the final report. Otherwise, orphan nodes (= nodes not found in the db/tax) are reported as 'na' with root as direct parent. (default: False) -y, --split-hierarchy Split output reports by hierarchy (from ganon classify --hierarchy-labels). If activated, the output files will be named as '{output_prefix}.{hierarchy}.tre' (default: False) -p [ ...], --skip-hierarchy [ ...] One or more hierarchies to skip in the report (from ganon classify --hierarchy-labels) (default: []) -k [ ...], --keep-hierarchy [ ...] One or more hierarchies to keep in the report (from ganon classify --hierarchy-labels) (default: []) -c , --top-percentile Top percentile filter, based on percentage/relative abundance. Applied only at default ranks [superkingdom, phylum, class, order, family, genus, species, assembly] (default: 0) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: []) ganon table usage: ganon table [-h] -i [...] [-e] -o OUTPUT_FILE [-l] [-f] [-t] [-a] [-m] [-r] [-n] [--header] [--unclassified-label] [--filtered-label] [--skip-zeros] [--transpose] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.tre' file(s) from ganon report. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: tre) -o OUTPUT_FILE, --output-file OUTPUT_FILE Output filename for the table (default: None) output arguments: -l , --output-value Output value on the table [percentage, counts]. percentage values are reported between [0-1] (default: counts) -f , --output-format Output format [tsv, csv] (default: tsv) -t , --top-sample Top hits of each sample individually (default: 0) -a , --top-all Top hits of all samples (ranked by percentage) (default: 0) -m , --min-frequency Minimum number/percentage of files containing an taxa to keep the taxa [values between 0-1 for percentage, >1 specific number] (default: 0) -r , --rank Define specific rank to report. Empty will report all ranks. (default: None) -n, --no-root Do not report root node entry and lineage. Direct and shared matches to root will be accounted as unclassified (default: False) --header Header information [name, taxid, lineage] (default: name) --unclassified-label Add column with unclassified count/percentage with the chosen label. May be the same as --filtered-label (e.g. unassigned) (default: None) --filtered-label Add column with filtered count/percentage with the chosen label. May be the same as --unclassified-label (e.g. unassigned) (default: None) --skip-zeros Do not print lines with only zero count/percentage (default: False) --transpose Transpose output table (taxa as cols and files as rows) (default: False) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: [])","title":"ganon2"},{"location":"#ganon","text":"Code: GitHub repository ganon2 pre-print ganon is designed to index large sets of genomic reference sequences and to classify reads against them efficiently. The tool uses Hierarchical Interleaved Bloom Filters as indices based on k-mers with optional minimizers. It was mainly developed, but not limited, to the metagenomics classification problem: quickly assign sequence fragments to their closest reference among thousands of references. After classification, taxonomic or sequence abundances are estimated and reported.","title":"ganon"},{"location":"#features","text":"integrated download and build of any subset from RefSeq/Genbank/GTDB with incremental updates NCBI and GTDB native support for taxonomic classification, custom taxonomy or no taxonomy at all customizable database build for local or non-standard sequence files optimized taxonomic binning and profiling configurations build and classify at various taxonomic levels, strain, assembly, file, sequence or custom specialization hierarchical classification using several databases in one or more levels in just one run EM and/or LCA algorithms to solve multiple-matching reads reporting of multiple and unique matches for every read reporting of sequence, taxonomic or multi-match abundances with optional genome size correction advanced tree-like reports with several filter options generation of contingency tables with several filters for multi-sample studies ganon achieved very good results in our own evaluations but also in independent evaluations: LEMMI , LEMMI v2 and CAMI2","title":"Features"},{"location":"#installation-with-conda","text":"The easiest way to install ganon is via conda, using the bioconda and conda-forge channels: conda install -c bioconda -c conda-forge ganon However, there are possible performance benefits compiling ganon from source in the target machine rather than using the conda version. To do so, please follow the instructions below:","title":"Installation with conda"},{"location":"#installation-from-source","text":"","title":"Installation from source"},{"location":"#python-dependencies","text":"python >=3.6 pandas >=1.1.0 multitax >=1.3.1 # Python version should be >=3.6 python3 -V # Install packages via pip or conda: # PIP python3 -m pip install \"pandas>=1.1.0\" \"multitax>=1.3.1\" # Conda (alternative) conda install \"pandas>=1.1.0\" \"multitax>=1.3.1\"","title":"Python dependencies"},{"location":"#c-dependencies","text":"GCC >=11 CMake >=3.4 zlib bzip2 raptor >=3.0.1 Tip If your system has GCC version 10 or below, you can create an environment with the latest conda-forge GCC version and dependencies: conda create -c conda-forge -n gcc-conda gcc gxx zlib bzip2 cmake and activate the environment with: source activate gcc-conda . In CMake, you may have set the environment include directory with the following parameter: -DSEQAN3_CXX_FLAGS=\"-I/path/to/miniconda3/envs/gcc-conda/include/\" changing /path/to/miniconda3 with your local path to the conda installation.","title":"C++ dependencies"},{"location":"#downloading-and-building-ganon-submodules","text":"git clone --recurse-submodules https://github.com/pirovc/ganon.git # Install Python side cd ganon python3 setup.py install --record files.txt # optional # Compile and install C++ side mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DVERBOSE_CONFIG=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCONDA=OFF -DLONGREADS=OFF .. make -j 4 sudo make install # optional to change install location (e.g. /myprefix/bin/ ), set the installation prefix in the cmake command with -DCMAKE_INSTALL_PREFIX=/myprefix/ use -DINCLUDE_DIRS to set alternative paths to cxxopts and Catch2 libs. to classify extremely large reads or contigs that would need more than 65000 k-mers, use -DLONGREADS=ON","title":"Downloading and building ganon + submodules"},{"location":"#installing-raptor","text":"The easiest way to install raptor is via conda with conda install -c bioconda -c conda-forge \"raptor>=3.0.1\" (already included in ganon install via conda). Note raptor is required to build databases with the Hierarchical Interleaved Bloom Filter ( ganon build --filter-type hibf ) To build old style ganon indices ganon build --filter-type ibf , raptor is not required To install raptor from source, follow the instructions below:","title":"Installing raptor"},{"location":"#dependencies","text":"CMake >= 3.18 GCC 11, 12 or 13 (most recent minor version)","title":"Dependencies"},{"location":"#downloading-and-building-raptor-submodules","text":"git clone --branch raptor-v3.0.1 --recurse-submodules https://github.com/seqan/raptor cd raptor mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=\"-std=c++23 -Wno-interference-size\" .. make -j 4 binaries will be located in the bin directory you may have to inform ganon build the path to the binaries with --raptor-path raptor/build/bin","title":"Downloading and building raptor + submodules"},{"location":"#testing","text":"If everything was properly installed, the following command should show the help pages without errors: ganon -h","title":"Testing"},{"location":"#running-tests","text":"python3 -m unittest discover -s tests/ganon/integration/ python3 -m unittest discover -s tests/ganon/integration_online/ # optional - downloads large files cd build/ ctest -VV .","title":"Running tests"},{"location":"#parameters","text":"usage: ganon [-h] [-v] {build,build-custom,update,classify,reassign,report,table} ... - - - - - - - - - - _ _ _ _ _ (_|(_|| |(_)| | _| v. 2.0.1 - - - - - - - - - - positional arguments: {build,build-custom,update,classify,reassign,report,table} build Download and build ganon default databases (refseq/genbank) build-custom Build custom ganon databases update Update ganon default databases classify Classify reads against built databases reassign Reassign reads with multiple matches with an EM algorithm report Generate reports from classification results table Generate table from reports options: -h, --help show this help message and exit -v, --version Show program's version number and exit. ganon build usage: ganon build [-h] [-g [...]] [-a [...]] [-l] [-b [...]] [-o] [-c] [-r] [-u] [-m [...]] [-z [...]] [--skip-genome-size] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -g [ ...], --organism-group [ ...] One or more organism groups to download [archaea, bacteria, fungi, human, invertebrate, metagenomes, other, plant, protozoa, vertebrate_mammalian, vertebrate_other, viral]. Mutually exclusive --taxid (default: None) -a [ ...], --taxid [ ...] One or more taxonomic identifiers to download. e.g. 562 (-x ncbi) or 's__Escherichia coli' (-x gtdb). Mutually exclusive --organism-group (default: None) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) database arguments: -l , --level Highest level to build the database. Options: any available taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis (default: species) download arguments: -b [ ...], --source [ ...] Source to download [refseq, genbank] (default: ['refseq']) -o , --top Download limited assemblies for each taxa. 0 for all. (default: 0) -c, --complete-genomes Download only sub-set of complete genomes (default: False) -r, --representative-genomes Download only sub-set of representative genomes (default: False) -u , --genome-updater Additional genome_updater parameters (https://github.com/pirovc/genome_updater) (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon build-custom usage: ganon build-custom [-h] [-i [...]] [-e] [-c] [-n] [-a] [-l] [-m [...]] [-z [...]] [--skip-genome-size] [-r [...]] [-q [...]] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). Mutually exclusive --input-file. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: fna.gz) -c, --input-recursive Look for files recursively in folder(s) provided with --input (default: False) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) custom arguments: -n , --input-file Manually set information for input files: file [target node specialization specialization name]. target is the sequence identifier if --input-target sequence (file can be repeated for multiple sequences). if --input-target file and target is not set, filename is used. node is the taxonomic identifier. Mutually exclusive --input (default: None) -a , --input-target Target to use [file, sequence]. By default: 'file' if multiple input files are provided or --input-file is set, 'sequence' if a single file is provided. Using 'file' is recommended and will speed-up the building process (default: None) -l , --level Use a specialized target to build the database. By default, --level is the --input-target. Options: any available taxonomic rank [species, genus, ...] or 'leaves' (requires --taxonomy). Further specialization options [assembly, custom]. assembly will retrieve and use the assembly accession and name. custom requires and uses the specialization field in the --input-file. (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) ncbi arguments: -r [ ...], --ncbi-sequence-info [ ...] Uses NCBI e-utils webservices or downloads accession2taxid files to extract target information. [eutils, nucl_gb, nucl_wgs, nucl_est, nucl_gss, pdb, prot, dead_nucl, dead_wgs, dead_prot or one or more accession2taxid files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/]. By default uses e-utils up-to 50000 sequences or downloads nucl_gb nucl_wgs otherwise. (default: []) -q [ ...], --ncbi-file-info [ ...] Downloads assembly_summary files to extract target information. [refseq, genbank, refseq_historical, genbank_historical or one or more assembly_summary files from https://ftp.ncbi.nlm.nih.gov/genomes/] (default: ['refseq', 'genbank']) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon update usage: ganon update [-h] -d DB_PREFIX [-o] [-t] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -d DB_PREFIX, --db-prefix DB_PREFIX Existing database input prefix (default: None) important arguments: -o , --output-db-prefix Output database prefix. By default will be the same as --db-prefix and overwrite files (default: None) -t , --threads optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon classify usage: ganon classify [-h] -d [DB_PREFIX ...] [-s [reads.fq[.gz] ...]] [-p [reads.1.fq[.gz] reads.2.fq[.gz] ...]] [-c [...]] [-e [...]] [-m] [--ranks [...]] [--min-count] [--report-type] [--skip-report] [-o] [--output-one] [--output-all] [--output-unclassified] [--output-single] [-t] [-b] [-f [...]] [-l [...]] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -d [DB_PREFIX ...], --db-prefix [DB_PREFIX ...] Database input prefix[es] (default: None) -s [reads.fq[.gz] ...], --single-reads [reads.fq[.gz] ...] Multi-fastq[.gz] file[s] to classify (default: None) -p [reads.1.fq[.gz] reads.2.fq[.gz] ...], --paired-reads [reads.1.fq[.gz] reads.2.fq[.gz] ...] Multi-fastq[.gz] pairs of file[s] to classify (default: None) cutoff/filter arguments: -c [ ...], --rel-cutoff [ ...] Min. percentage of a read (set of k-mers) shared with a reference necessary to consider a match. Generally used to remove low similarity matches. Single value or one per database (e.g. 0.7 1 0.25). 0 for no cutoff (default: [0.75]) -e [ ...], --rel-filter [ ...] Additional relative percentage of matches (relative to the best match) to keep. Generally used to keep top matches above cutoff. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [0.1]) post-processing/report arguments: -m , --multiple-matches Method to solve reads with multiple matches [em, lca, skip]. em -> expectation maximization algorithm based on unique matches. lca -> lowest common ancestor based on taxonomy. The EM algorithm can be executed later with 'ganon reassign' using the .all file (--output-all). (default: em) --ranks [ ...] Ranks to report taxonomic abundances (.tre). empty will report default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) --min-count Minimum percentage/counts to report an taxa (.tre) [use values between 0-1 for percentage, >1 for counts] (default: 5e-05) --report-type Type of report (.tre) [abundance, reads, matches, dist, corr]. More info in 'ganon report'. (default: abundance) --skip-report Disable tree-like report (.tre) at the end of classification. Can be done later with 'ganon report'. (default: False) output arguments: -o , --output-prefix Output prefix for output (.rep) and tree-like report (.tre). Empty to output to STDOUT (only .rep) (default: None) --output-one Output a file with one match for each read (.one) either an unique match or a result from the EM or a LCA algorithm (--multiple-matches) (default: False) --output-all Output a file with all unique and multiple matches (.all) (default: False) --output-unclassified Output a file with unclassified read headers (.unc) (default: False) --output-single When using multiple hierarchical levels, output everything in one file instead of one per hierarchy (default: False) other arguments: -t , --threads Number of sub-processes/threads to use (default: 1) -b, --binning Optimized parameters for binning (--rel-cutoff 0.25 --rel-filter 0 --min-count 0 --report-type reads). Will report sequence abundances (.tre) instead of tax. abundance. (default: False) -f [ ...], --fpr-query [ ...] Max. false positive of a query to accept a match. Applied after --rel-cutoff and --rel-filter. Generally used to remove false positives matches querying a database build with large --max-fp. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [1e-05]) -l [ ...], --hierarchy-labels [ ...] Hierarchy definition of --db-prefix files to be classified. Can also be a string, but input will be sorted to define order (e.g. 1 1 2 3). The default value reported without hierarchy is 'H1' (default: None) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon reassign usage: ganon reassign [-h] -i -o OUTPUT_PREFIX [-e] [-s] [--remove-all] [--skip-one] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -i , --input-prefix Input prefix to find files from ganon classify (.all and optionally .rep) (default: None) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for reassigned file (.one and optionally .rep). In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.out' (default: None) EM arguments: -e , --max-iter Max. number of iterations for the EM algorithm. If 0, will run until convergence (check --threshold) (default: 10) -s , --threshold Convergence threshold limit to stop the EM algorithm. (default: 0) other arguments: --remove-all Remove input file (.all) after processing. (default: False) --skip-one Do not write output file (.one) after processing. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon report usage: ganon report [-h] -i [...] [-e INPUT_EXTENSION] -o OUTPUT_PREFIX [-d [...]] [-x] [-m [...]] [-z [...]] [--skip-genome-size] [-f] [-t] [-r [...]] [-s] [-a] [-y] [-p [...]] [-k [...]] [-c] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.rep' file(s) from ganon classify. (default: None) -e INPUT_EXTENSION, --input-extension INPUT_EXTENSION Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: rep) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for report file 'output_prefix.tre'. In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.tre' (default: None) db/tax arguments: -d [ ...], --db-prefix [ ...] Database prefix(es) used for classification. Only '.tax' file(s) are required. If not provided, new taxonomy will be downloaded. Mutually exclusive with --taxonomy. (default: []) -x , --taxonomy Taxonomy database to use [ncbi, gtdb, skip]. Mutually exclusive with --db-prefix. (default: ncbi) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Valid only without --db-prefix. Activate this option when using sequences not representing full genomes. (default: False) output arguments: -f , --output-format Output format [text, tsv, csv, bioboxes]. text outputs a tabulated formatted text file for better visualization. bioboxes is the the CAMI challenge profiling format (only percentage/abundances are reported). (default: tsv) -t , --report-type Type of report [abundance, reads, matches, dist, corr]. 'abundance' -> tax. abundance (re- distribute read counts and correct by genome size), 'reads' -> sequence abundance, 'matches' -> report all unique and shared matches, 'dist' -> like reads with re-distribution of shared read counts only, 'corr' -> like abundance without re-distribution of shared read counts (default: abundance) -r [ ...], --ranks [ ...] Ranks to report ['', 'all', custom list]. 'all' for all possible ranks. empty for default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) -s , --sort Sort report by [rank, lineage, count, unique]. Default: rank (with custom --ranks) or lineage (with --ranks all) (default: ) -a, --no-orphan Omit orphan nodes from the final report. Otherwise, orphan nodes (= nodes not found in the db/tax) are reported as 'na' with root as direct parent. (default: False) -y, --split-hierarchy Split output reports by hierarchy (from ganon classify --hierarchy-labels). If activated, the output files will be named as '{output_prefix}.{hierarchy}.tre' (default: False) -p [ ...], --skip-hierarchy [ ...] One or more hierarchies to skip in the report (from ganon classify --hierarchy-labels) (default: []) -k [ ...], --keep-hierarchy [ ...] One or more hierarchies to keep in the report (from ganon classify --hierarchy-labels) (default: []) -c , --top-percentile Top percentile filter, based on percentage/relative abundance. Applied only at default ranks [superkingdom, phylum, class, order, family, genus, species, assembly] (default: 0) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: []) ganon table usage: ganon table [-h] -i [...] [-e] -o OUTPUT_FILE [-l] [-f] [-t] [-a] [-m] [-r] [-n] [--header] [--unclassified-label] [--filtered-label] [--skip-zeros] [--transpose] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.tre' file(s) from ganon report. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: tre) -o OUTPUT_FILE, --output-file OUTPUT_FILE Output filename for the table (default: None) output arguments: -l , --output-value Output value on the table [percentage, counts]. percentage values are reported between [0-1] (default: counts) -f , --output-format Output format [tsv, csv] (default: tsv) -t , --top-sample Top hits of each sample individually (default: 0) -a , --top-all Top hits of all samples (ranked by percentage) (default: 0) -m , --min-frequency Minimum number/percentage of files containing an taxa to keep the taxa [values between 0-1 for percentage, >1 specific number] (default: 0) -r , --rank Define specific rank to report. Empty will report all ranks. (default: None) -n, --no-root Do not report root node entry and lineage. Direct and shared matches to root will be accounted as unclassified (default: False) --header Header information [name, taxid, lineage] (default: name) --unclassified-label Add column with unclassified count/percentage with the chosen label. May be the same as --filtered-label (e.g. unassigned) (default: None) --filtered-label Add column with filtered count/percentage with the chosen label. May be the same as --unclassified-label (e.g. unassigned) (default: None) --skip-zeros Do not print lines with only zero count/percentage (default: False) --transpose Transpose output table (taxa as cols and files as rows) (default: False) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: [])","title":"Parameters"},{"location":"classification/","text":"Classification :) ganon classify will match single and/or paired-end sets of reads against one or more databases . By default, parameters are optimized for taxonomic profiling , meaning that less reads will be classified but with a higher sensitivity. For example: ganon classify --db-prefix my_db --paired-reads reads.1.fq.gz reads.2.fq.gz --output-prefix results --threads 32 Output files: results.rep : plain report of the run, used to further generate tree-like reports results.tre : tree-like report with cumulative abundances by taxonomic ranks (can be re-generated with ganon report ) By default, ganon classify only write report files. To get files with the classification of each read, use --output-one and/or --output-all . More information about output files here . Note ganon performs taxonomic profiling and/or binning (one tax. assignment for each read) at a taxonomic, strain or sequence level. Some guidelines are listed below, please choose the parameters according to your application. Profiling :) ganon classify is set-up by default to perform taxonomic profiling. It uses: strict thresholds: --rel-cutoff 0.75 and --rel-filter 0.1 --min-count 0.00005 (0.005%) to exclude very low abundant taxa --report-type abundance to generate taxonomic abundances, correcting for genome sizes (more infos here ) Binning :) To achieve better results for taxonomic binning or sequence classification, ganon classify can be configured with --binning , that is the same as: less strict thresholds: --rel-cutoff 0.25 --rel-filter 0 --min-count 0 reports all taxa with at least one read assigned to it --report-type reads will report sequence abundances instead of taxonomic abundances (more infos here ) Tip Database parameters in ganon build can also influence your results. Lower --max-fp (e.g. 0.1, 0.001) and higher --kmer-size (e.g. 23 , 27 ) will improve sensitivity of your results at cost of a larger database and memory usage. Reads with multiple matches :) There are two ways to solve reads with multiple-matches in ganon classify : --multiple-matches em (default): uses an Expectation-Maximization algorithm, re-assigning reads with multiple matches to one most probable target (defined by --level in the build procedure). --multiple-matches lca : uses the Lowest Common Ancestor algorithm, re-assigning reads with multiple matches to higher common ancestors in the taxonomic tree. --multiple-matches skip : will not resolve multi-matching reads Tip The Expectation-Maximization can be performed independently with ganon reassign using the output files .rep and .all . Reports can be generated independently with ganon report using the output file .rep Note --multiple-matches lca paired with --report-type abundance or dist will distribute read counts with multiple matches to one most probable target (defined by --level in the build procedure), instead of a higher taxonomic rank. In this case the distribution is simply based on the number of taxa with unique matches and it is not as precise as the EM algorithm, but it will run faster since the per-read basis re-assignment can be skipped. Classifying more reads :) By default ganon will classify less reads in favour of sensitivity. To classify more reads, use less strict --rel-cutoff and --rel-filter values (e.g. 0.25 and 0 , respectively). More details here . Multiple and Hierarchical classification :) ganon classify can be performed in multiple databases at the same time. The databases can also be provided in a hierarchical order. Multiple database classification can be performed providing several inputs for --db-prefix . They are required to be built with the same --kmer-size and --window-size values. Multiple databases are considered as one (as if built together) and redundancy in content (same reference in two or more databases) is allowed. To classify reads in a hierarchical order, --hierarchy-labels should be provided. When using multiple hierarchical levels, output files will be generated for each level (use --output-single to generate a single output from multiple hierarchical levels). Please note that some parameters are set for each database (e.g. --rel-cutoff ) while others are set for each hierarchical level (e.g. --rel-filter ) Examples Classification against 3 database (as if they were one) using the same cutoff: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.75 \\ --single-reads reads.fq.gz Classification against 3 database (as if they were one) using different error rates for each: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.2 0.3 0.1 \\ --single-reads reads.fq.gz In this example, reads are going to be classified first against db1 and db2. Reads without a valid match will be further classified against db3. `--hierarchy-labels` are strings and are going to be sorted to define the hierarchy order, disregarding input order: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --single-reads reads.fq.gz In this example, classification will be performed with different `--rel-cutoff` for each database. For each hierarchy levels (`1_first` and `2_second`) a different `--rel-filter` will be used: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --rel-cutoff 1 0.5 0.25 \\ --rel-filter 0.1 0.5 \\ --single-reads reads.fq.gz Parameter details :) reads (--single-reads, --paired-reads) :) ganon accepts single-end and paired-end reads. Both types can be use at the same time. In paired-end mode, reads are always reported with the header of the first pair. Paired-end reads are classified in a standard forward-reverse orientation. The max. read length accepted is 65535 (accounting both reads in paired mode). cutoff and filter (--rel-cutoff, --rel-filter) :) ganon has two main parameters to control the number and strictess of matches between reads and references: --rel-cutoff and --rel-filter . Every read can be classified against none, one or more references. ganon will report only matches after cutoff and filter thresholds are applied, based on the number of shared k-mers between sequences (use --rel-cutoff 0 and --rel-filter 1 to deactivate them). The cutoff is the first to be applied. It sets the min. percentage of k-mers of a read to be shared with a reference to consider a match. Next the filter is applied to the remaining matches. filter thresholds are relative to the best and worst scoring match after cutoff and control the percentage of additional matches (if any) should be reported, sorted from the best to worst. filter won't change the total number of matched reads but will change the amount of unique or multi-matched reads. cutoff can be interpreted as the lower bound to discard spurious matches and filter as the fine tuning to control what to keep. In summary: --rel-cutoff controls the strictness of the matching algorithm. lower values -> more read matches higher values -> less read matches --rel-filter controls how many matches each read will have, from best to worst lower values -> more unique matching reads higher values -> more multi-matching reads For example, using a hypothetical number of k-mer matches, a certain read with 82 k-mers has the following matches with the 5 references ( ref1..5 ), sorted number of shared k-mers: reference shared k-mers ref1 82 ref2 68 ref3 44 ref4 25 ref5 20 With --rel-cutoff 0.25 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 ref1 82 ref2 68 ref3 44 ref4 25 ~~ref5~~ ~~20~~ X since the --rel-cutoff threshold is 82 * 0.25 = 21 (ceiling is applied). Next, with --rel-filter 0.5 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 --rel-filter 0.5 ref1 82 ref2 68 ~~ref3~~ ~~44~~ X ~~ref4~~ ~~25~~ X ~~ref5~~ ~~20~~ X since 82 is the best match and 25 is the worst remaining match, the filter will keep the top the remaining matches, based on the shared k-mers threshold 82 - ((82-25)*0.5) = 54 (ceiling is applied). ref1 and ref2 are reported as matches Tip The actual number of unique k-mers in a read are used as an upper bound to calculate the thresholds. The same is applied when using --window-size and minimizers. Note A different --rel-cutoff can be set for every database in a multiple or hierarchical database classification. A different --rel-filter can be set for every level of a hierarchical database classification. Note Reads that remain with only one reference match (after cutoff and filter are applied) are considered a unique match. False positive of a query (--fpr-query) :) ganon uses Bloom Filters, probabilistic data structures that may return false positive results. The base false positive of a ganon index is controlled by --max-fp when building the database. However, this value is the expected false positive for each k-mer. In practice, a sequence (several k-mers) will have a way smaller false positive. ganon calculates the false positive rate of a query as suggested by (Solomon and Kingsford, 2016). The --fpr-query will control the max. value accepted to consider a match between a sequence and a reference, avoiding false positives that may be introduce by the properties of the data structure. By default, --fpr-query 1e-5 is used and it is applied after the --rel-cutoff and --rel-filter . Values between 1e-3 and 1e-10 are recommended. This threshold becomes more important when building smaller databases with higher --max-fp , assuring that the false positive is under control. In this case however, sensitivity of results may decrease. Note The false positive of a query was first propose in: Solomon, Brad, and Carl Kingsford. \u201cFast Search of Thousands of Short-Read Sequencing Experiments.\u201d Nature Biotechnology 34, no. 3 (2016): 1\u20136. https://doi.org/10.1038/nbt.3442.","title":"Classification (ganon classify)"},{"location":"classification/#classification","text":"ganon classify will match single and/or paired-end sets of reads against one or more databases . By default, parameters are optimized for taxonomic profiling , meaning that less reads will be classified but with a higher sensitivity. For example: ganon classify --db-prefix my_db --paired-reads reads.1.fq.gz reads.2.fq.gz --output-prefix results --threads 32 Output files: results.rep : plain report of the run, used to further generate tree-like reports results.tre : tree-like report with cumulative abundances by taxonomic ranks (can be re-generated with ganon report ) By default, ganon classify only write report files. To get files with the classification of each read, use --output-one and/or --output-all . More information about output files here . Note ganon performs taxonomic profiling and/or binning (one tax. assignment for each read) at a taxonomic, strain or sequence level. Some guidelines are listed below, please choose the parameters according to your application.","title":"Classification"},{"location":"classification/#profiling","text":"ganon classify is set-up by default to perform taxonomic profiling. It uses: strict thresholds: --rel-cutoff 0.75 and --rel-filter 0.1 --min-count 0.00005 (0.005%) to exclude very low abundant taxa --report-type abundance to generate taxonomic abundances, correcting for genome sizes (more infos here )","title":"Profiling"},{"location":"classification/#binning","text":"To achieve better results for taxonomic binning or sequence classification, ganon classify can be configured with --binning , that is the same as: less strict thresholds: --rel-cutoff 0.25 --rel-filter 0 --min-count 0 reports all taxa with at least one read assigned to it --report-type reads will report sequence abundances instead of taxonomic abundances (more infos here ) Tip Database parameters in ganon build can also influence your results. Lower --max-fp (e.g. 0.1, 0.001) and higher --kmer-size (e.g. 23 , 27 ) will improve sensitivity of your results at cost of a larger database and memory usage.","title":"Binning"},{"location":"classification/#reads-with-multiple-matches","text":"There are two ways to solve reads with multiple-matches in ganon classify : --multiple-matches em (default): uses an Expectation-Maximization algorithm, re-assigning reads with multiple matches to one most probable target (defined by --level in the build procedure). --multiple-matches lca : uses the Lowest Common Ancestor algorithm, re-assigning reads with multiple matches to higher common ancestors in the taxonomic tree. --multiple-matches skip : will not resolve multi-matching reads Tip The Expectation-Maximization can be performed independently with ganon reassign using the output files .rep and .all . Reports can be generated independently with ganon report using the output file .rep Note --multiple-matches lca paired with --report-type abundance or dist will distribute read counts with multiple matches to one most probable target (defined by --level in the build procedure), instead of a higher taxonomic rank. In this case the distribution is simply based on the number of taxa with unique matches and it is not as precise as the EM algorithm, but it will run faster since the per-read basis re-assignment can be skipped.","title":"Reads with multiple matches"},{"location":"classification/#classifying-more-reads","text":"By default ganon will classify less reads in favour of sensitivity. To classify more reads, use less strict --rel-cutoff and --rel-filter values (e.g. 0.25 and 0 , respectively). More details here .","title":"Classifying more reads"},{"location":"classification/#multiple-and-hierarchical-classification","text":"ganon classify can be performed in multiple databases at the same time. The databases can also be provided in a hierarchical order. Multiple database classification can be performed providing several inputs for --db-prefix . They are required to be built with the same --kmer-size and --window-size values. Multiple databases are considered as one (as if built together) and redundancy in content (same reference in two or more databases) is allowed. To classify reads in a hierarchical order, --hierarchy-labels should be provided. When using multiple hierarchical levels, output files will be generated for each level (use --output-single to generate a single output from multiple hierarchical levels). Please note that some parameters are set for each database (e.g. --rel-cutoff ) while others are set for each hierarchical level (e.g. --rel-filter ) Examples Classification against 3 database (as if they were one) using the same cutoff: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.75 \\ --single-reads reads.fq.gz Classification against 3 database (as if they were one) using different error rates for each: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.2 0.3 0.1 \\ --single-reads reads.fq.gz In this example, reads are going to be classified first against db1 and db2. Reads without a valid match will be further classified against db3. `--hierarchy-labels` are strings and are going to be sorted to define the hierarchy order, disregarding input order: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --single-reads reads.fq.gz In this example, classification will be performed with different `--rel-cutoff` for each database. For each hierarchy levels (`1_first` and `2_second`) a different `--rel-filter` will be used: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --rel-cutoff 1 0.5 0.25 \\ --rel-filter 0.1 0.5 \\ --single-reads reads.fq.gz","title":"Multiple and Hierarchical classification"},{"location":"classification/#parameter-details","text":"","title":"Parameter details"},{"location":"classification/#reads-single-reads-paired-reads","text":"ganon accepts single-end and paired-end reads. Both types can be use at the same time. In paired-end mode, reads are always reported with the header of the first pair. Paired-end reads are classified in a standard forward-reverse orientation. The max. read length accepted is 65535 (accounting both reads in paired mode).","title":"reads (--single-reads, --paired-reads)"},{"location":"classification/#cutoff-and-filter-rel-cutoff-rel-filter","text":"ganon has two main parameters to control the number and strictess of matches between reads and references: --rel-cutoff and --rel-filter . Every read can be classified against none, one or more references. ganon will report only matches after cutoff and filter thresholds are applied, based on the number of shared k-mers between sequences (use --rel-cutoff 0 and --rel-filter 1 to deactivate them). The cutoff is the first to be applied. It sets the min. percentage of k-mers of a read to be shared with a reference to consider a match. Next the filter is applied to the remaining matches. filter thresholds are relative to the best and worst scoring match after cutoff and control the percentage of additional matches (if any) should be reported, sorted from the best to worst. filter won't change the total number of matched reads but will change the amount of unique or multi-matched reads. cutoff can be interpreted as the lower bound to discard spurious matches and filter as the fine tuning to control what to keep. In summary: --rel-cutoff controls the strictness of the matching algorithm. lower values -> more read matches higher values -> less read matches --rel-filter controls how many matches each read will have, from best to worst lower values -> more unique matching reads higher values -> more multi-matching reads For example, using a hypothetical number of k-mer matches, a certain read with 82 k-mers has the following matches with the 5 references ( ref1..5 ), sorted number of shared k-mers: reference shared k-mers ref1 82 ref2 68 ref3 44 ref4 25 ref5 20 With --rel-cutoff 0.25 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 ref1 82 ref2 68 ref3 44 ref4 25 ~~ref5~~ ~~20~~ X since the --rel-cutoff threshold is 82 * 0.25 = 21 (ceiling is applied). Next, with --rel-filter 0.5 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 --rel-filter 0.5 ref1 82 ref2 68 ~~ref3~~ ~~44~~ X ~~ref4~~ ~~25~~ X ~~ref5~~ ~~20~~ X since 82 is the best match and 25 is the worst remaining match, the filter will keep the top the remaining matches, based on the shared k-mers threshold 82 - ((82-25)*0.5) = 54 (ceiling is applied). ref1 and ref2 are reported as matches Tip The actual number of unique k-mers in a read are used as an upper bound to calculate the thresholds. The same is applied when using --window-size and minimizers. Note A different --rel-cutoff can be set for every database in a multiple or hierarchical database classification. A different --rel-filter can be set for every level of a hierarchical database classification. Note Reads that remain with only one reference match (after cutoff and filter are applied) are considered a unique match.","title":"cutoff and filter (--rel-cutoff, --rel-filter)"},{"location":"classification/#false-positive-of-a-query-fpr-query","text":"ganon uses Bloom Filters, probabilistic data structures that may return false positive results. The base false positive of a ganon index is controlled by --max-fp when building the database. However, this value is the expected false positive for each k-mer. In practice, a sequence (several k-mers) will have a way smaller false positive. ganon calculates the false positive rate of a query as suggested by (Solomon and Kingsford, 2016). The --fpr-query will control the max. value accepted to consider a match between a sequence and a reference, avoiding false positives that may be introduce by the properties of the data structure. By default, --fpr-query 1e-5 is used and it is applied after the --rel-cutoff and --rel-filter . Values between 1e-3 and 1e-10 are recommended. This threshold becomes more important when building smaller databases with higher --max-fp , assuring that the false positive is under control. In this case however, sensitivity of results may decrease. Note The false positive of a query was first propose in: Solomon, Brad, and Carl Kingsford. \u201cFast Search of Thousands of Short-Read Sequencing Experiments.\u201d Nature Biotechnology 34, no. 3 (2016): 1\u20136. https://doi.org/10.1038/nbt.3442.","title":"False positive of a query (--fpr-query)"},{"location":"custom_databases/","text":"Custom databases :) Default NCBI assembly or sequence accession :) Besides the automated download and build ( ganon build ) ganon provides a highly customizable build procedure ( ganon build-custom ) to create databases from local sequence files. To use custom sequences, just provide them with --input . ganon will try to retrieve all necessary information necessary to build a database. Note ganon expects assembly accessions in the filename like GCA_002211645.1_ASM221164v1_genomic.fna.gz . When using --input-target sequence filenames are not important but sequence headers should contain sequence accessions like >CP022124.1 Fusobacterium nu... . More information about building by file or sequence can be found here . Non-standard/custom accessions :) It is also possible to use non-standard accessions and headers to build custom databases with --input-file . This file should contain the following fields (tab-separated): file [ target node specialization specialization_name]. Note that file is mandatory and additional fields not. Tip If you just want to build a database without any taxonomic or target information, just sent the files with --input , use --taxonomy skip and choose between --input-target file or sequence . Warning the target and specialization fields (2nd and 4th col) cannot be the same as the node (3rd col) Examples of --input-file With --input-target file (default), where my_target_1 and my_target_2 are just names to assign sequences from (unique) sequence files: sequences.fasta my_target_1 others.fasta my_target_2 With --input-target sequence, second column should match sequence headers on provided sequence files (that should be repeated for each header): sequences.fasta HEADER1 sequences.fasta HEADER2 sequences.fasta HEADER3 others.fasta HEADER4 others.fasta HEADER5 A third column with taxonomic nodes can be provided to link the data with taxonomy. For example with --taxonomy ncbi: sequences.fasta FILE_A 562 others.fasta FILE_B 623 sequences.fasta HEADER1 562 sequences.fasta HEADER2 562 sequences.fasta HEADER3 562 others.fasta HEADER4 623 others.fasta HEADER5 623 Further specializations can be used to create a additional classification level after the taxonomic leaves. For example (using --level custom): sequences.fasta FILE_A 562 ID44444 Escherichia coli TW10119 others.fasta FILE_B 623 ID55555 Shigella flexneri 1a sequences.fasta HEADER1 562 ID443 Escherichia coli TW10119 sequences.fasta HEADER2 562 ID297 Escherichia coli PCN079 sequences.fasta HEADER3 562 ID8873 Escherichia coli P0301867.7 others.fasta HEADER4 623 ID2241 Shigella flexneri 1a others.fasta HEADER5 623 ID4422 Shigella flexneri 1b Examples :) Some examples with download and build commands for custom ganon databases from useful and commonly used repositories and datasets for metagenomics analysis: HumGut :) Collection of >30000 genomes from healthy human metagenomes. Article / Website . # Download sequence files wget --quiet --show-progress \"http://arken.nmbu.no/~larssn/humgut/HumGut.tar.gz\" tar xf HumGut.tar.gz # Download taxonomy and metadata files wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_names.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/HumGut.tsv\" # Generate --input-file from metadata tail -n+2 HumGut.tsv | awk -F\"\\t\" '{print \"fna/\"$21\"\\t\"$1\"\\t\"$2}' > HumGut_ganon_input_file.tsv # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files ncbi_nodes.dmp ncbi_names.dmp --db-prefix HumGut --level strain --threads 32 Similarly using GTDB taxonomy files: # Download taxonomy files wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_names.dmp\" # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files gtdb_nodes.dmp gtdb_names.dmp --db-prefix HumGut_gtdb --level strain --threads 32 Note There is no need to use ganon's gtdb integration here since GTDB files in NCBI format are available Plasmid, Plastid and Mitochondrion from RefSeq :) Extra repositories from RefSeq release not included as default databases. Website . # Download sequence files wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/\" # Split sequences in files and retrieve taxonomy mkdir sequences/ zcat plasmid.* plastid.* mitochondrion.* | awk '$0 ~ \">\" {accver=(substr($1,2)); print accver}{print $0 > \"sequences/\"accver\".fna\"}' | ganon-get-seq-info.sh -e -i - | awk '{print \"sequences/\"$1\".fna\\t\"$1\"\\t\"$3}' > ppm.tsv # Build ganon database ganon build-custom --input-file ppm.tsv --db-prefix ppm --level species --threads 16 # OPTIONAL Remove temporary folder and downloaded files rm -rf sequences/ ppm.tsv plasmid.* plastid.* mitochondrion.* UniVec, UniVec_core :) \"UniVec is a non-redundant database of sequences commonly attached to cDNA or genomic DNA during the cloning process.\" Website . Useful to screen for vector and linker/adapter contamination. UniVec_core is a sub-set of the UniVec selected to reduce the false positive hits from real biological sources. # UniVec wget -O \"UniVec.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec\" echo -e \"UniVec.fasta\\tUniVec\\t81077\" > UniVec_Core_ganon_input_file.tsv ganon build-custom --input-file UniVec_ganon_input_file.tsv --db-prefix UniVec --level leaves --threads 8 # UniVec_Core wget -O \"UniVec_Core.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec_Core\" echo -e \"UniVec_Core.fasta\\tUniVec_Core\\t81077\" > UniVec_Core_ganon_input_file.tsv ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves --threads 8 Note All UniVec entries in the examples are categorized as Artificial Sequence (NCBI txid:81077). Some are completely artificial but others may be derived from real biological sources. More information in this link . MGnify genome catalogues (MAGs) :) \"Genome catalogues are biome-specific collections of metagenomic-assembled and isolate genomes\". Article / Website / FTP . There are currently (2023-05-04) 8 genome catalogues available: chicken-gut, human-gut, human-oral, marine, non-model-fish-gut, pig-gut and zebrafish-fecal. An example below how to download and build the human-oral catalog: # Download metadata wget \"https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-oral/v1.0/genomes-all_metadata.tsv\" # Download sequence files with 12 threads tail -n+2 genomes-all_metadata.tsv | cut -f 1,20 | xargs -P 12 -n2 sh -c 'curl --silent ${1}| gzip -d | sed -e \"1,/##FASTA/ d\" | gzip > ${0}.fna.gz' # Generate ganon input file tail -n+2 genomes-all_metadata.tsv | cut -f 1,15 | tr ';' '\\t' | awk -F\"\\t\" '{tax=\"1\";for(i=NF;i>1;i--){if(length($i)>3){tax=$i;break;}};print $1\".fna.gz\\t\"$1\"\\t\"tax}' > ganon_input_file.tsv # Build ganon database ganon build-custom --input-file ganon_input_file.tsv --db-prefix mgnify_human_oral_v1 --taxonomy gtdb --level leaves --threads 32 Note MGnify genomes catalogues will be build with GTDB taxonomy. Pathogen detection FDA-ARGOS :) A collection of >1400 \"microbes that include biothreat microorganisms, common clinical pathogens and closely related species\". Article / Website / BioProject . # Download sequence files wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt grep \"strain=FDAARGOS\" assembly_summary_refseq.txt > fdaargos_assembly_summary.txt genome_updater.sh -e fdaargos_assembly_summary.txt -f \"genomic.fna.gz\" -o download -m -t 12 # Build ganon database ganon build-custom --input download/ --input-recursive --db-prefix fdaargos --ncbi-file-info download/assembly_summary.txt --level assembly --threads 32 Note The example above uses genome_updater to download files BLAST databases (nt env_nt nt_prok ...) :) BLAST databases. Website / FTP . Current available nucleotide databases (2023-05-04): 16S_ribosomal_RNA 18S_fungal_sequences 28S_fungal_sequences Betacoronavirus env_nt human_genome ITS_eukaryote_sequences ITS_RefSeq_Fungi LSU_eukaryote_rRNA LSU_prokaryote_rRNA mito mouse_genome nt nt_euk nt_others nt_prok nt_viruses patnt pdbnt ref_euk_rep_genomes ref_prok_rep_genomes refseq_rna refseq_select_rna ref_viroids_rep_genomes ref_viruses_rep_genomes SSU_eukaryote_rRNA tsa_nt Note List currently available nucleotide databases curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"nucl-metadata.json\" | sed 's/-nucl-metadata.json/, /g' | sort Warning Some BLAST databases are very big and may require extreme computational resources to build. You may need to use some reduction strategies The example below extracts sequences and information from a BLAST db to build a ganon database: # Define BLAST db db=\"16S_ribosomal_RNA\" threads=8 # Download BLAST db - re-run this command many times until all finish (no more output) curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"^${db}\\..*tar.gz$\" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress \"https://ftp.ncbi.nlm.nih.gov/blast/db/{}\" # OPTIONAL Download and check MD5 wget -O - -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}\\.*tar.gz.md5\" > \"${db}.md5\" find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads:-1} -I{} md5sum {} > \"${db}_downloaded.md5\" diff -sy <(sort -k 2,2 \"${db}.md5\") <(sort -k 2,2 \"${db}_downloaded.md5\") # Should print \"Files /dev/fd/xx and /dev/fd/xx are identical\" # Extract BLAST db files, if successful, remove .tar.gz find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > \"${db}_extracted_files.txt\" # Create folder to write sequence files (split into 10 sub-folders) seq 0 9 | xargs -i mkdir -p \"${db}\"/{} # This command extracts sequences from the blastdb and writes them into taxid specific files # It also generates the --input-file for ganon blastdbcmd -entry all -db \"${db}\" -outfmt \"%a %T %s\" | \\ awk -v db=\"$(realpath ${db})\" '{file=db\"/\"substr($2,1,1)\"/\"$2\".fna\"; print \">\"$1\"\\n\"$3 >> file; print file\"\\t\"$2\"\\t\"$2}' | \\ sort | uniq > \"${db}_ganon_input_file.tsv\" # Build ganon database ganon build-custom --input-file \"${db}_ganon_input_file.tsv\" --db-prefix \"${db}\" --threads 12 # Delete extracted files and auxiliary files cat \"${db}_extracted_files.txt\" | xargs rm rm \"${db}_extracted_files.txt\" \"${db}.md5\" \"${db}_downloaded.md5\" # Delete sequences rm -rf \"${db}\" \"${db}_ganon_input_file.tsv\" Note blastdbcmd is a command from BLAST+ software suite (tested version 2.14.0) and should be installed separately. Files from genome_updater :) To create a ganon database from files previosly downloaded with genome_updater : ganon build-custom --input output_folder_genome_updater/version/ --input-recursive --db-prefix mydb --ncbi-file-info output_folder_genome_updater/assembly_summary.txt --level assembly --threads 32 Parameter details :) False positive and size (--max-fp, --filter-size) :) ganon indices are based on bloom filters and can have false positive matches. This can be controlled with --max-fp parameter. The lower the --max-fp , the less chances of false positives matches on classification, but the larger the database size will be. For example, with --max-fp 0.01 the database will be build so any target (defined by --level ) will have 1 in a 100 change of reporting a false k-mer match. The false positive of the query (all k-mers of a read) will be way lower, but directly affected by this value. Alternatively, one can set a specific size for the final index with --filter-size . When using this option, please observe the theoretic false positive of the index reported at the end of the building process. minimizers (--window-size, --kmer-size) :) in ganon build , when --window-size > --kmer-size minimizers are used. That means that for a every window, a single k-mer will be selected. It produces smaller database files and requires substantially less memory overall. It may increase building times but will have a huge benefit for classification times. Sensitivity and precision can be reduced by small margins. If --window-size = --kmer-size , all k-mers are going to be used to build the database. Target file or sequence (--input-target) :) Customized builds can be done either by file or sequence. --input-target file will consider every file provided with --input a single unit. --input-target sequence will use every sequence as a unit. --input-target file is the default behavior and most efficient way to build databases. --input-target sequence should only be used when the input sequences are stored in a single file or when classification at sequence level is desired. Build level (--level) :) The --level parameter defines the max. depth of the database for classification. This parameter is relevant because the --max-fp is going to be guaranteed at the --level chosen. By default, the level will be the same as --input-target , meaning that classification will be done either at file or sequence level. Alternatively, --level assembly will link the file or sequence target information with assembly accessions retrieved from NCBI servers. --level leaves or --level species (or genus, family, ...) will link the targets with taxonomic information and prune the tree at the chosen level. --level custom will use specialization level define in the --input-file . Genome sizes (--genome-size-files) :) Ganon will automatically download auxiliary files to define an approximate genome size for each entry in the taxonomic tree. For --taxonomy ncbi the species_genome_size.txt.gz is used. For --taxonomy gtdb the *_metadata.tar.gz files are used. Those files can be directly provided with the --genome-size-files argument. Genome sizes of parent nodes are calculated as the average of the respective children nodes. Other nodes without direct assigned genome sizes will use the closest parent with a pre-calculated genome size. The genome sizes are stored in the ganon database . Retrieving info (--ncbi-sequence-info, --ncbi-file-info) :) Further taxonomy and assembly linking information has to be collected to properly build the database. --ncbi-sequence-info and --ncbi-file-info allow customizations on this step. When --input-target sequence , --ncbi-sequence-info argument allows the use of NCBI e-utils webservices ( eutils ) or downloads accession2taxid files to extract target information (options nucl_gb nucl_wgs nucl_est nucl_gss pdb prot dead_nucl dead_wgs dead_prot ). By default, ganon uses eutils up-to 50000 input sequences, otherwise it downloads nucl_gb nucl_wgs from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/. Previously downloaded files can be directly provided with this argument. When --input-target file , --ncbi-file-info uses assembly_summary.txt from https://ftp.ncbi.nlm.nih.gov/genomes/ to extract target information (options refseq genbank refseq_historical genbank_historical . Previously downloaded files can be directly provided with this argument. If you are using outdated, removed or inactive assembly or sequence files and accessions from NCBI, make sure to include dead_nucl dead_wgs for --ncbi-sequence-info or refseq_historical genbank_historical for --ncbi-file-info . eutils option does not work with outdated accessions.","title":"Custom databases (ganon build-custom)"},{"location":"custom_databases/#custom-databases","text":"","title":"Custom databases"},{"location":"custom_databases/#default-ncbi-assembly-or-sequence-accession","text":"Besides the automated download and build ( ganon build ) ganon provides a highly customizable build procedure ( ganon build-custom ) to create databases from local sequence files. To use custom sequences, just provide them with --input . ganon will try to retrieve all necessary information necessary to build a database. Note ganon expects assembly accessions in the filename like GCA_002211645.1_ASM221164v1_genomic.fna.gz . When using --input-target sequence filenames are not important but sequence headers should contain sequence accessions like >CP022124.1 Fusobacterium nu... . More information about building by file or sequence can be found here .","title":"Default NCBI assembly or sequence accession"},{"location":"custom_databases/#non-standardcustom-accessions","text":"It is also possible to use non-standard accessions and headers to build custom databases with --input-file . This file should contain the following fields (tab-separated): file [ target node specialization specialization_name]. Note that file is mandatory and additional fields not. Tip If you just want to build a database without any taxonomic or target information, just sent the files with --input , use --taxonomy skip and choose between --input-target file or sequence . Warning the target and specialization fields (2nd and 4th col) cannot be the same as the node (3rd col) Examples of --input-file With --input-target file (default), where my_target_1 and my_target_2 are just names to assign sequences from (unique) sequence files: sequences.fasta my_target_1 others.fasta my_target_2 With --input-target sequence, second column should match sequence headers on provided sequence files (that should be repeated for each header): sequences.fasta HEADER1 sequences.fasta HEADER2 sequences.fasta HEADER3 others.fasta HEADER4 others.fasta HEADER5 A third column with taxonomic nodes can be provided to link the data with taxonomy. For example with --taxonomy ncbi: sequences.fasta FILE_A 562 others.fasta FILE_B 623 sequences.fasta HEADER1 562 sequences.fasta HEADER2 562 sequences.fasta HEADER3 562 others.fasta HEADER4 623 others.fasta HEADER5 623 Further specializations can be used to create a additional classification level after the taxonomic leaves. For example (using --level custom): sequences.fasta FILE_A 562 ID44444 Escherichia coli TW10119 others.fasta FILE_B 623 ID55555 Shigella flexneri 1a sequences.fasta HEADER1 562 ID443 Escherichia coli TW10119 sequences.fasta HEADER2 562 ID297 Escherichia coli PCN079 sequences.fasta HEADER3 562 ID8873 Escherichia coli P0301867.7 others.fasta HEADER4 623 ID2241 Shigella flexneri 1a others.fasta HEADER5 623 ID4422 Shigella flexneri 1b","title":"Non-standard/custom accessions"},{"location":"custom_databases/#examples","text":"Some examples with download and build commands for custom ganon databases from useful and commonly used repositories and datasets for metagenomics analysis:","title":"Examples"},{"location":"custom_databases/#humgut","text":"Collection of >30000 genomes from healthy human metagenomes. Article / Website . # Download sequence files wget --quiet --show-progress \"http://arken.nmbu.no/~larssn/humgut/HumGut.tar.gz\" tar xf HumGut.tar.gz # Download taxonomy and metadata files wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_names.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/HumGut.tsv\" # Generate --input-file from metadata tail -n+2 HumGut.tsv | awk -F\"\\t\" '{print \"fna/\"$21\"\\t\"$1\"\\t\"$2}' > HumGut_ganon_input_file.tsv # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files ncbi_nodes.dmp ncbi_names.dmp --db-prefix HumGut --level strain --threads 32 Similarly using GTDB taxonomy files: # Download taxonomy files wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_names.dmp\" # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files gtdb_nodes.dmp gtdb_names.dmp --db-prefix HumGut_gtdb --level strain --threads 32 Note There is no need to use ganon's gtdb integration here since GTDB files in NCBI format are available","title":"HumGut"},{"location":"custom_databases/#plasmid-plastid-and-mitochondrion-from-refseq","text":"Extra repositories from RefSeq release not included as default databases. Website . # Download sequence files wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/\" # Split sequences in files and retrieve taxonomy mkdir sequences/ zcat plasmid.* plastid.* mitochondrion.* | awk '$0 ~ \">\" {accver=(substr($1,2)); print accver}{print $0 > \"sequences/\"accver\".fna\"}' | ganon-get-seq-info.sh -e -i - | awk '{print \"sequences/\"$1\".fna\\t\"$1\"\\t\"$3}' > ppm.tsv # Build ganon database ganon build-custom --input-file ppm.tsv --db-prefix ppm --level species --threads 16 # OPTIONAL Remove temporary folder and downloaded files rm -rf sequences/ ppm.tsv plasmid.* plastid.* mitochondrion.*","title":"Plasmid, Plastid and Mitochondrion from RefSeq"},{"location":"custom_databases/#univec-univec_core","text":"\"UniVec is a non-redundant database of sequences commonly attached to cDNA or genomic DNA during the cloning process.\" Website . Useful to screen for vector and linker/adapter contamination. UniVec_core is a sub-set of the UniVec selected to reduce the false positive hits from real biological sources. # UniVec wget -O \"UniVec.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec\" echo -e \"UniVec.fasta\\tUniVec\\t81077\" > UniVec_Core_ganon_input_file.tsv ganon build-custom --input-file UniVec_ganon_input_file.tsv --db-prefix UniVec --level leaves --threads 8 # UniVec_Core wget -O \"UniVec_Core.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec_Core\" echo -e \"UniVec_Core.fasta\\tUniVec_Core\\t81077\" > UniVec_Core_ganon_input_file.tsv ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves --threads 8 Note All UniVec entries in the examples are categorized as Artificial Sequence (NCBI txid:81077). Some are completely artificial but others may be derived from real biological sources. More information in this link .","title":"UniVec, UniVec_core"},{"location":"custom_databases/#mgnify-genome-catalogues-mags","text":"\"Genome catalogues are biome-specific collections of metagenomic-assembled and isolate genomes\". Article / Website / FTP . There are currently (2023-05-04) 8 genome catalogues available: chicken-gut, human-gut, human-oral, marine, non-model-fish-gut, pig-gut and zebrafish-fecal. An example below how to download and build the human-oral catalog: # Download metadata wget \"https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-oral/v1.0/genomes-all_metadata.tsv\" # Download sequence files with 12 threads tail -n+2 genomes-all_metadata.tsv | cut -f 1,20 | xargs -P 12 -n2 sh -c 'curl --silent ${1}| gzip -d | sed -e \"1,/##FASTA/ d\" | gzip > ${0}.fna.gz' # Generate ganon input file tail -n+2 genomes-all_metadata.tsv | cut -f 1,15 | tr ';' '\\t' | awk -F\"\\t\" '{tax=\"1\";for(i=NF;i>1;i--){if(length($i)>3){tax=$i;break;}};print $1\".fna.gz\\t\"$1\"\\t\"tax}' > ganon_input_file.tsv # Build ganon database ganon build-custom --input-file ganon_input_file.tsv --db-prefix mgnify_human_oral_v1 --taxonomy gtdb --level leaves --threads 32 Note MGnify genomes catalogues will be build with GTDB taxonomy.","title":"MGnify genome catalogues (MAGs)"},{"location":"custom_databases/#pathogen-detection-fda-argos","text":"A collection of >1400 \"microbes that include biothreat microorganisms, common clinical pathogens and closely related species\". Article / Website / BioProject . # Download sequence files wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt grep \"strain=FDAARGOS\" assembly_summary_refseq.txt > fdaargos_assembly_summary.txt genome_updater.sh -e fdaargos_assembly_summary.txt -f \"genomic.fna.gz\" -o download -m -t 12 # Build ganon database ganon build-custom --input download/ --input-recursive --db-prefix fdaargos --ncbi-file-info download/assembly_summary.txt --level assembly --threads 32 Note The example above uses genome_updater to download files","title":"Pathogen detection FDA-ARGOS"},{"location":"custom_databases/#blast-databases-nt-env_nt-nt_prok","text":"BLAST databases. Website / FTP . Current available nucleotide databases (2023-05-04): 16S_ribosomal_RNA 18S_fungal_sequences 28S_fungal_sequences Betacoronavirus env_nt human_genome ITS_eukaryote_sequences ITS_RefSeq_Fungi LSU_eukaryote_rRNA LSU_prokaryote_rRNA mito mouse_genome nt nt_euk nt_others nt_prok nt_viruses patnt pdbnt ref_euk_rep_genomes ref_prok_rep_genomes refseq_rna refseq_select_rna ref_viroids_rep_genomes ref_viruses_rep_genomes SSU_eukaryote_rRNA tsa_nt Note List currently available nucleotide databases curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"nucl-metadata.json\" | sed 's/-nucl-metadata.json/, /g' | sort Warning Some BLAST databases are very big and may require extreme computational resources to build. You may need to use some reduction strategies The example below extracts sequences and information from a BLAST db to build a ganon database: # Define BLAST db db=\"16S_ribosomal_RNA\" threads=8 # Download BLAST db - re-run this command many times until all finish (no more output) curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"^${db}\\..*tar.gz$\" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress \"https://ftp.ncbi.nlm.nih.gov/blast/db/{}\" # OPTIONAL Download and check MD5 wget -O - -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}\\.*tar.gz.md5\" > \"${db}.md5\" find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads:-1} -I{} md5sum {} > \"${db}_downloaded.md5\" diff -sy <(sort -k 2,2 \"${db}.md5\") <(sort -k 2,2 \"${db}_downloaded.md5\") # Should print \"Files /dev/fd/xx and /dev/fd/xx are identical\" # Extract BLAST db files, if successful, remove .tar.gz find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > \"${db}_extracted_files.txt\" # Create folder to write sequence files (split into 10 sub-folders) seq 0 9 | xargs -i mkdir -p \"${db}\"/{} # This command extracts sequences from the blastdb and writes them into taxid specific files # It also generates the --input-file for ganon blastdbcmd -entry all -db \"${db}\" -outfmt \"%a %T %s\" | \\ awk -v db=\"$(realpath ${db})\" '{file=db\"/\"substr($2,1,1)\"/\"$2\".fna\"; print \">\"$1\"\\n\"$3 >> file; print file\"\\t\"$2\"\\t\"$2}' | \\ sort | uniq > \"${db}_ganon_input_file.tsv\" # Build ganon database ganon build-custom --input-file \"${db}_ganon_input_file.tsv\" --db-prefix \"${db}\" --threads 12 # Delete extracted files and auxiliary files cat \"${db}_extracted_files.txt\" | xargs rm rm \"${db}_extracted_files.txt\" \"${db}.md5\" \"${db}_downloaded.md5\" # Delete sequences rm -rf \"${db}\" \"${db}_ganon_input_file.tsv\" Note blastdbcmd is a command from BLAST+ software suite (tested version 2.14.0) and should be installed separately.","title":"BLAST databases (nt env_nt nt_prok ...)"},{"location":"custom_databases/#files-from-genome_updater","text":"To create a ganon database from files previosly downloaded with genome_updater : ganon build-custom --input output_folder_genome_updater/version/ --input-recursive --db-prefix mydb --ncbi-file-info output_folder_genome_updater/assembly_summary.txt --level assembly --threads 32","title":"Files from genome_updater"},{"location":"custom_databases/#parameter-details","text":"","title":"Parameter details"},{"location":"custom_databases/#false-positive-and-size-max-fp-filter-size","text":"ganon indices are based on bloom filters and can have false positive matches. This can be controlled with --max-fp parameter. The lower the --max-fp , the less chances of false positives matches on classification, but the larger the database size will be. For example, with --max-fp 0.01 the database will be build so any target (defined by --level ) will have 1 in a 100 change of reporting a false k-mer match. The false positive of the query (all k-mers of a read) will be way lower, but directly affected by this value. Alternatively, one can set a specific size for the final index with --filter-size . When using this option, please observe the theoretic false positive of the index reported at the end of the building process.","title":"False positive and size (--max-fp, --filter-size)"},{"location":"custom_databases/#minimizers-window-size-kmer-size","text":"in ganon build , when --window-size > --kmer-size minimizers are used. That means that for a every window, a single k-mer will be selected. It produces smaller database files and requires substantially less memory overall. It may increase building times but will have a huge benefit for classification times. Sensitivity and precision can be reduced by small margins. If --window-size = --kmer-size , all k-mers are going to be used to build the database.","title":"minimizers (--window-size, --kmer-size)"},{"location":"custom_databases/#target-file-or-sequence-input-target","text":"Customized builds can be done either by file or sequence. --input-target file will consider every file provided with --input a single unit. --input-target sequence will use every sequence as a unit. --input-target file is the default behavior and most efficient way to build databases. --input-target sequence should only be used when the input sequences are stored in a single file or when classification at sequence level is desired.","title":"Target file or sequence (--input-target)"},{"location":"custom_databases/#build-level-level","text":"The --level parameter defines the max. depth of the database for classification. This parameter is relevant because the --max-fp is going to be guaranteed at the --level chosen. By default, the level will be the same as --input-target , meaning that classification will be done either at file or sequence level. Alternatively, --level assembly will link the file or sequence target information with assembly accessions retrieved from NCBI servers. --level leaves or --level species (or genus, family, ...) will link the targets with taxonomic information and prune the tree at the chosen level. --level custom will use specialization level define in the --input-file .","title":"Build level (--level)"},{"location":"custom_databases/#genome-sizes-genome-size-files","text":"Ganon will automatically download auxiliary files to define an approximate genome size for each entry in the taxonomic tree. For --taxonomy ncbi the species_genome_size.txt.gz is used. For --taxonomy gtdb the *_metadata.tar.gz files are used. Those files can be directly provided with the --genome-size-files argument. Genome sizes of parent nodes are calculated as the average of the respective children nodes. Other nodes without direct assigned genome sizes will use the closest parent with a pre-calculated genome size. The genome sizes are stored in the ganon database .","title":"Genome sizes (--genome-size-files)"},{"location":"custom_databases/#retrieving-info-ncbi-sequence-info-ncbi-file-info","text":"Further taxonomy and assembly linking information has to be collected to properly build the database. --ncbi-sequence-info and --ncbi-file-info allow customizations on this step. When --input-target sequence , --ncbi-sequence-info argument allows the use of NCBI e-utils webservices ( eutils ) or downloads accession2taxid files to extract target information (options nucl_gb nucl_wgs nucl_est nucl_gss pdb prot dead_nucl dead_wgs dead_prot ). By default, ganon uses eutils up-to 50000 input sequences, otherwise it downloads nucl_gb nucl_wgs from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/. Previously downloaded files can be directly provided with this argument. When --input-target file , --ncbi-file-info uses assembly_summary.txt from https://ftp.ncbi.nlm.nih.gov/genomes/ to extract target information (options refseq genbank refseq_historical genbank_historical . Previously downloaded files can be directly provided with this argument. If you are using outdated, removed or inactive assembly or sequence files and accessions from NCBI, make sure to include dead_nucl dead_wgs for --ncbi-sequence-info or refseq_historical genbank_historical for --ncbi-file-info . eutils option does not work with outdated accessions.","title":"Retrieving info (--ncbi-sequence-info, --ncbi-file-info)"},{"location":"default_databases/","text":"Databases :) ganon automates the download, update and build of databases based on NCBI RefSeq and GenBank genomes repositories wtih ganon build and update commands, for example: ganon build -g archaea bacteria -d arc_bac -c -t 30 This will download archaeal and bacterial complete genomes from RefSeq and build a database with 30 threads. Some day later, the database can be updated to include newest genomes with: ganon update -d arc_bac -t 30 Additionally, custom databases can be built with customized files and identifiers with the ganon build-custom command. Info We DO NOT provide pre-built indices for download. ganon can build databases very efficiently. This way, you will always have up-to-date reference sequences and get most out of your data. RefSeq and GenBank :) NCBI RefSeq and GenBank repositories are common resources to obtain reference sequences to analyze metagenomics data. They are mainly divided into domains/organism groups (e.g. archaea, bacteria, fungi, ...) but can be further filtered. The choice of those filters can drastically change the outcome of results. Commonly used sub-sets :) RefSeq (2023-03-14) # assemblies # species Size* ganon build All genomes 295219 52781 160 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs All genomes - 1 assembly/species 52781 52781 128 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_t1s Complete genomes 44121 19715 35 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_rs_cg Complete genomes - 1 assembly/species 19715 19715 29 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_cg_t1s Representative genomes 18073 18073 69 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --representative-genomes --db-prefix abfv_rs_rg GenBank (2023-03-14) # assemblies # species Size* ganon build All genomes 1595845 99505 - ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb All genomes - 1 assembly/species 99505 99505 300 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_gb_t1s Complete genomes 92917 34815 42 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg Complete genomes - 1 assembly/species 34815 34815 34 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes \"-A 'species:1'\" --db-prefix abfv_gb_cg_t1s Info Data obtained in 2023-03-14 for archaea, bacteria, fungi and viral groups only. By the time you are reading this, those numbers certainly grew a bit. The commands provided will download up-to-date assemblies and will require slightly larger resources. GTDB R214 # assemblies # species Size* ganon build All genomes 402709 85205 260 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --db-prefix ab_gtdb All genomes - 1 assembly/species 85205 85205 213 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --top 1 --db-prefix ab_gtdb_t1s Info GTDB covers only bacteria and archaea groups and has assemblies from RefSeq and GenBank. * in GB -> ganon requires up-to 2x the database size of memory to build it. The memory required to use it in classification is approx. the same as the database size. As a rule of thumb, the more the better, so choose the most comprehensive sub-set as possible given your computational resources It is possible to build databases that consume a fixed size/RAM usage. Beware that smaller filters will increase the false positive rates when classifying. Other approaches can reduce the size/RAM requirements with some trade-offs . Alternatively, you can build one database for each organism group separately and use them in ganon classify in any order or even stack them hierarchically . This way combination of multiple databases are possible, extending use cases. Further examples of commonly used database can be found here . Specific organisms or taxonomic groups :) It is also possible to generate databases for specific organisms or taxonomic branches with -a/--taxid , for example: ganon build --source refseq --taxid 562 317 --threads 48 --db-prefix coli_syringae will download and build a database for all Escherichia coli (taxid:562) and Pseudomonas syringae (taxid:317) assemblies from RefSeq. More filter options :) ganon uses genome_updater to manage downloads and further specific options and filters can be provided with the paramer -u/--genome-updater , for example: ganon build -g bacteria -t 48 -d bac_refseq --genome-updater \"-A 'genus:3' -E 20230101\" will download top 3 archaeal assemblies for each genus with date before 2023-01-01. For more information about genome_updater parameters, please check the repository . GTDB :) By default, ganon will use the NCBI Taxonomy to build the database. However, GTDB is fully supported and can be used with the parameter --taxonomy gtdb . Filtering by taxonomic entries also work with GTDB, for example: ganon build --db-prefix fuso_gtdb --taxid \"f__Fusobacteriaceae\" --source refseq genbank --taxonomy gtdb --threads 12 Update (ganon update) :) Default ganon databases generated with the ganon build can be updated with ganon update . This procedure will download new files and re-generate the ganon database with the updated entries. For example, a database generated with the following command: ganon build --db-prefix arc_cg_rs --source refseq --organism-group archaea --complete-genomes --threads 12 will contain all archaeal complete genomes from NCBI RefSeq at the time of running. Some days later, the database can be updated, fetching only new sequences added to the NCBI repository with the command: ganon update --db-prefix arc_cg_rs --threads 12 Tip To not overwrite the current database and create a new one with the updated files, use the --output-db-prefix parameter. Reproducibility :) If you use ganon with default databases and want to re-generate it later or keep track of the content for reproducibility purposes, you can save the assembly_summary.txt file located inside the {output_prefix}_files/ directory. To re-download the exact same snapshot of files used, one could use genome_updater , for example: genome_updater.sh -e assembly_summary.txt -f \"genomic.fna.gz\" -o recovered_files -m -t 12 Reducing database size :) Filter type (IBF and HIBF) :) The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates smaller databases with faster query times ( article ). However, the HIBF takes longer to build and has less flexibility regarding size and further options in ganon. You can choose which filter to use with the --filter-type parameter in ganon build and ganon build-custom - Due to differences between the default IBF used in ganon and the HIBF, it is recommended to lower the false positive when using the HIBF. The default value for high sensitivity is 1% ( --filter-type hibf --max-fp 0.001 ). Hint For large unbalanced reference sets, lots of reads to query -> HIBF (default) For quick database build and more flexibility -> IBF False positive rate :) A higher --max-fp value will generate a smaller database but with a higher number of false positive matches on classification. More details . Values between 0.001 (0.1%) and 0.3 (30%) are generally used. Hint When using higher --max-fp values, more false positive results may be generated. This can be filtered with the --fpr-query parameter in ganon classify k-mer and window size :) Define how much unique information is stored in the database. More details The smaller the --kmer-size , the less unique they will be, reducing database size but also sensitivity in classification. The bigger the --window-size , the less information needs to be stored resulting in smaller databases but with decrease classification accuracy. Top assemblies :) RefSeq and GenBank are highly biased toward some few organisms. This means that some species are highly represented in number of assemblies compared to others. This can not only bias analysis but also brings redundancy to the database. Choosing a certain number of top assemblies can mitigate those issues. Database sizes can also be drastically reduced without this redundancy, but \"strain-level\" analysis are then not possible. We recommend using top assemblies for larger and comprehensive reference sets (like the ones listed above ) and use the full set of assemblies for specific clade analysis. Example ganon build --top 1 will select one assembly for each taxonomic leaf (NCBI taxonomy still has strain, sub-species, ...) ganon build --genome-updater \"-A 'species:1'\" will select one assembly for each species ganon build --genome-updater \"-A 'genus:3'\" will select three assemblies for each genus Split databases :) Ganon allows classification with multiple databases in one level or in an hierarchy ( More details ). This means that databases can be built separately and used in any combination as desired. There are usually some benefits of doing so: Smaller databases when building by organism group, for example: one for bacteria, another for viruses, ... since average genome sizes are quite different. Easier to maintain and update. Extend use cases and avoid misclassification due to contaminated databases. Use databases as quality control, for example: remove reads matching one database of host or vectors (check out ganon report --skip-hierarchy ). Fixed size and Mode (only for --filter-type ibf) :) A fixed size for the database filter can be defined with --filter-size when using --filter-type ibf . The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. More details . --mode offers 5 different categories to build a database controlling the trade-off between size and classification speed. avg : Balanced mode smaller or smallest : create smaller databases with slower classification speed fast or fastest : create bigger databases with faster classification speed Warning If --filter-size is used, smaller and smallest refers to the false positive and not to the database size (which is fixed). Example :) Besides the benefits of using HIBF and specific sub-sets of big repositories shown on the default databases table , examples of other reduction strategies with IBF can be seen below: RefSeq archaeal complete genomes from 2023-05-05 Strategy Size (MB) Smaller Trade-off default 318 - - cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --db-prefix arc_rs_cg --filter-type ibf --mode smallest 301 5% Slower classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --mode smallest --db-prefix arc_rs_cg_smallest --filter-type ibf --filter-size 256 256 19% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --filter-size 256 --db-prefix arc_rs_cg_fs256 --filter-type ibf --window-size 35 249 21% Less sensitive classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --window-size 35 --db-prefix arc_rs_cg_ws35 --filter-type ibf --max-fp 0.2 190 40% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --max-fp 0.2 --db-prefix arc_rs_cg_fp0.2 --filter-type ibf Note This is an illustrative example and the reduction proportions for different configuration may be quite different","title":"Databases (ganon build)"},{"location":"default_databases/#databases","text":"ganon automates the download, update and build of databases based on NCBI RefSeq and GenBank genomes repositories wtih ganon build and update commands, for example: ganon build -g archaea bacteria -d arc_bac -c -t 30 This will download archaeal and bacterial complete genomes from RefSeq and build a database with 30 threads. Some day later, the database can be updated to include newest genomes with: ganon update -d arc_bac -t 30 Additionally, custom databases can be built with customized files and identifiers with the ganon build-custom command. Info We DO NOT provide pre-built indices for download. ganon can build databases very efficiently. This way, you will always have up-to-date reference sequences and get most out of your data.","title":"Databases"},{"location":"default_databases/#refseq-and-genbank","text":"NCBI RefSeq and GenBank repositories are common resources to obtain reference sequences to analyze metagenomics data. They are mainly divided into domains/organism groups (e.g. archaea, bacteria, fungi, ...) but can be further filtered. The choice of those filters can drastically change the outcome of results.","title":"RefSeq and GenBank"},{"location":"default_databases/#commonly-used-sub-sets","text":"RefSeq (2023-03-14) # assemblies # species Size* ganon build All genomes 295219 52781 160 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs All genomes - 1 assembly/species 52781 52781 128 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_t1s Complete genomes 44121 19715 35 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_rs_cg Complete genomes - 1 assembly/species 19715 19715 29 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_cg_t1s Representative genomes 18073 18073 69 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --representative-genomes --db-prefix abfv_rs_rg GenBank (2023-03-14) # assemblies # species Size* ganon build All genomes 1595845 99505 - ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb All genomes - 1 assembly/species 99505 99505 300 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_gb_t1s Complete genomes 92917 34815 42 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg Complete genomes - 1 assembly/species 34815 34815 34 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes \"-A 'species:1'\" --db-prefix abfv_gb_cg_t1s Info Data obtained in 2023-03-14 for archaea, bacteria, fungi and viral groups only. By the time you are reading this, those numbers certainly grew a bit. The commands provided will download up-to-date assemblies and will require slightly larger resources. GTDB R214 # assemblies # species Size* ganon build All genomes 402709 85205 260 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --db-prefix ab_gtdb All genomes - 1 assembly/species 85205 85205 213 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --top 1 --db-prefix ab_gtdb_t1s Info GTDB covers only bacteria and archaea groups and has assemblies from RefSeq and GenBank. * in GB -> ganon requires up-to 2x the database size of memory to build it. The memory required to use it in classification is approx. the same as the database size. As a rule of thumb, the more the better, so choose the most comprehensive sub-set as possible given your computational resources It is possible to build databases that consume a fixed size/RAM usage. Beware that smaller filters will increase the false positive rates when classifying. Other approaches can reduce the size/RAM requirements with some trade-offs . Alternatively, you can build one database for each organism group separately and use them in ganon classify in any order or even stack them hierarchically . This way combination of multiple databases are possible, extending use cases. Further examples of commonly used database can be found here .","title":"Commonly used sub-sets"},{"location":"default_databases/#specific-organisms-or-taxonomic-groups","text":"It is also possible to generate databases for specific organisms or taxonomic branches with -a/--taxid , for example: ganon build --source refseq --taxid 562 317 --threads 48 --db-prefix coli_syringae will download and build a database for all Escherichia coli (taxid:562) and Pseudomonas syringae (taxid:317) assemblies from RefSeq.","title":"Specific organisms or taxonomic groups"},{"location":"default_databases/#more-filter-options","text":"ganon uses genome_updater to manage downloads and further specific options and filters can be provided with the paramer -u/--genome-updater , for example: ganon build -g bacteria -t 48 -d bac_refseq --genome-updater \"-A 'genus:3' -E 20230101\" will download top 3 archaeal assemblies for each genus with date before 2023-01-01. For more information about genome_updater parameters, please check the repository .","title":"More filter options"},{"location":"default_databases/#gtdb","text":"By default, ganon will use the NCBI Taxonomy to build the database. However, GTDB is fully supported and can be used with the parameter --taxonomy gtdb . Filtering by taxonomic entries also work with GTDB, for example: ganon build --db-prefix fuso_gtdb --taxid \"f__Fusobacteriaceae\" --source refseq genbank --taxonomy gtdb --threads 12","title":"GTDB"},{"location":"default_databases/#update-ganon-update","text":"Default ganon databases generated with the ganon build can be updated with ganon update . This procedure will download new files and re-generate the ganon database with the updated entries. For example, a database generated with the following command: ganon build --db-prefix arc_cg_rs --source refseq --organism-group archaea --complete-genomes --threads 12 will contain all archaeal complete genomes from NCBI RefSeq at the time of running. Some days later, the database can be updated, fetching only new sequences added to the NCBI repository with the command: ganon update --db-prefix arc_cg_rs --threads 12 Tip To not overwrite the current database and create a new one with the updated files, use the --output-db-prefix parameter.","title":"Update (ganon update)"},{"location":"default_databases/#reproducibility","text":"If you use ganon with default databases and want to re-generate it later or keep track of the content for reproducibility purposes, you can save the assembly_summary.txt file located inside the {output_prefix}_files/ directory. To re-download the exact same snapshot of files used, one could use genome_updater , for example: genome_updater.sh -e assembly_summary.txt -f \"genomic.fna.gz\" -o recovered_files -m -t 12","title":"Reproducibility"},{"location":"default_databases/#reducing-database-size","text":"","title":"Reducing database size"},{"location":"default_databases/#filter-type-ibf-and-hibf","text":"The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates smaller databases with faster query times ( article ). However, the HIBF takes longer to build and has less flexibility regarding size and further options in ganon. You can choose which filter to use with the --filter-type parameter in ganon build and ganon build-custom - Due to differences between the default IBF used in ganon and the HIBF, it is recommended to lower the false positive when using the HIBF. The default value for high sensitivity is 1% ( --filter-type hibf --max-fp 0.001 ). Hint For large unbalanced reference sets, lots of reads to query -> HIBF (default) For quick database build and more flexibility -> IBF","title":"Filter type (IBF and HIBF)"},{"location":"default_databases/#false-positive-rate","text":"A higher --max-fp value will generate a smaller database but with a higher number of false positive matches on classification. More details . Values between 0.001 (0.1%) and 0.3 (30%) are generally used. Hint When using higher --max-fp values, more false positive results may be generated. This can be filtered with the --fpr-query parameter in ganon classify","title":"False positive rate"},{"location":"default_databases/#k-mer-and-window-size","text":"Define how much unique information is stored in the database. More details The smaller the --kmer-size , the less unique they will be, reducing database size but also sensitivity in classification. The bigger the --window-size , the less information needs to be stored resulting in smaller databases but with decrease classification accuracy.","title":"k-mer and window size"},{"location":"default_databases/#top-assemblies","text":"RefSeq and GenBank are highly biased toward some few organisms. This means that some species are highly represented in number of assemblies compared to others. This can not only bias analysis but also brings redundancy to the database. Choosing a certain number of top assemblies can mitigate those issues. Database sizes can also be drastically reduced without this redundancy, but \"strain-level\" analysis are then not possible. We recommend using top assemblies for larger and comprehensive reference sets (like the ones listed above ) and use the full set of assemblies for specific clade analysis. Example ganon build --top 1 will select one assembly for each taxonomic leaf (NCBI taxonomy still has strain, sub-species, ...) ganon build --genome-updater \"-A 'species:1'\" will select one assembly for each species ganon build --genome-updater \"-A 'genus:3'\" will select three assemblies for each genus","title":"Top assemblies"},{"location":"default_databases/#split-databases","text":"Ganon allows classification with multiple databases in one level or in an hierarchy ( More details ). This means that databases can be built separately and used in any combination as desired. There are usually some benefits of doing so: Smaller databases when building by organism group, for example: one for bacteria, another for viruses, ... since average genome sizes are quite different. Easier to maintain and update. Extend use cases and avoid misclassification due to contaminated databases. Use databases as quality control, for example: remove reads matching one database of host or vectors (check out ganon report --skip-hierarchy ).","title":"Split databases"},{"location":"default_databases/#fixed-size-and-mode-only-for-filter-type-ibf","text":"A fixed size for the database filter can be defined with --filter-size when using --filter-type ibf . The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. More details . --mode offers 5 different categories to build a database controlling the trade-off between size and classification speed. avg : Balanced mode smaller or smallest : create smaller databases with slower classification speed fast or fastest : create bigger databases with faster classification speed Warning If --filter-size is used, smaller and smallest refers to the false positive and not to the database size (which is fixed).","title":"Fixed size and Mode (only for --filter-type ibf)"},{"location":"default_databases/#example","text":"Besides the benefits of using HIBF and specific sub-sets of big repositories shown on the default databases table , examples of other reduction strategies with IBF can be seen below: RefSeq archaeal complete genomes from 2023-05-05 Strategy Size (MB) Smaller Trade-off default 318 - - cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --db-prefix arc_rs_cg --filter-type ibf --mode smallest 301 5% Slower classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --mode smallest --db-prefix arc_rs_cg_smallest --filter-type ibf --filter-size 256 256 19% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --filter-size 256 --db-prefix arc_rs_cg_fs256 --filter-type ibf --window-size 35 249 21% Less sensitive classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --window-size 35 --db-prefix arc_rs_cg_ws35 --filter-type ibf --max-fp 0.2 190 40% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --max-fp 0.2 --db-prefix arc_rs_cg_fp0.2 --filter-type ibf Note This is an illustrative example and the reduction proportions for different configuration may be quite different","title":"Example"},{"location":"outputfiles/","text":"Output files :) ganon build/build-custom/update :) Every run on ganon build , ganon build-custom or ganon update will generate the following database files: {prefix} .ibf/.hibf : main bloom filter index file, extension based on the --filter-type option. {prefix} .tax : taxonomy tree, only generated if --taxonomy is used (fields: target/node, parent, rank, name, genome size) . {prefix} _files/ : ( ganon build only) folder containing downloaded reference sequence and auxiliary files. Not necessary for classification. Keep this folder if the database will be update later. Otherwise it can be deleted. Warning Database files generated with version 1.2.0 or higher are not compatible with older versions. ganon classify :) {prefix} .tre : full report file (see below) {prefix} .rep : plain report of the run with only targets that received a match. Can be used to re-generate full reports (.tre) with ganon report . At the end prints 2 extra lines with #total_classified and #total_unclassified . Fields 1: hierarchy label 2: target 3: # total matches 4: # unique reads 5: # lca reads 6: rank 7: name {prefix} .one : output with one match for each classified read after EM or LCA algorithm. Only generated with --output-one active. If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.one (fields: read identifier, target, (max) k-mer/minimizer count) {prefix} .all : output with all matches for each read. Only generated with --output-all active Warning: file can be very large . If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.all (fields: read identifier, target, k-mer/minimizer count) ganon report :) {prefix} .tre : tab-separated tree-like report with cumulative counts and taxonomic lineage. There are several possible --report-type . More information on the different types of reports can be found here : abundance : will attempt to estimate taxonomic abundances by re-disributing read counts from LCA matches and correcting sequence abundance by approximate genome sizes. reads : sequence abundances , reports the proportion of sequences assigned to a taxa, each read classified is counted once. dist : like reads with read count re-distribution. corr : like reads with correction by genome size. matches : every match is reported to their original target, including multiple and shared matches. Each line in this report is a taxonomic entry (including the root node), with the following fields: col field obs example 1 rank phylum 2 target taxonomic id. or specialization (assembly id.) 562 3 lineage 1|131567|2|1224|28211|766|942|768|769 4 name Chromobacterium rhizoryzae 5 # unique number of reads that matched exclusively to this target 5 6 # shared number of reads with non-unique matches directly assigned to this target. Represents the LCA matches ( --report-type reads ), re-assigned matches ( --report-type abundance/dist ) or shared matches ( --report-type matches ) 10 7 # children number of unique and shared assignments to all children nodes of this target 20 8 # cumulative the sum of the unique, shared and children assignments up-to this target 35 9 % cumulative percentage of assignments or estimated relative abundance for --report-type abundance 43.24 The first line of the report file will show the number of unclassified reads (not for --report-type matches ) The CAMI challenge bioboxes profiling format is supported using --output-format bioboxes . In this format, only values for the percentage/abundance (col. 9) are reported. The root node and unclassified entries are omitted. The sum of cumulative assignments for the unclassified and root lines is 100%. The final cumulative sum of reads/matches may be under 100% if any filter is successfully applied and/or hierarchical selection is selected (keep/skip/split). For all report type but matches , only taxa that received direct read matches, either unique or by LCA assignment, are considered. Some reads may have only shared matches and will not be reported directly but will be accounted for on some parent level. To visualize those matches, create a report with --report-type matches or use directly the file {prefix} .rep . ganon table :) {output_file}: a tab-separated file with counts/percentages of taxa for multiple samples Examples of output files The main output file is the `{prefix}.tre` which will summarize the results: unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 superkingdom 2 1|2 Bacteria 0 0 97 97 97.97980 phylum 1239 1|2|1239 Firmicutes 0 0 57 57 57.57576 phylum 1224 1|2|1224 Proteobacteria 0 0 40 40 40.40404 class 91061 1|2|1239|91061 Bacilli 0 0 57 57 57.57576 class 28211 1|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 class 1236 1|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 1385 1|2|1239|91061|1385 Bacillales 0 0 57 57 57.57576 order 204458 1|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 order 72274 1|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 186822 1|2|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 family 76892 1|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 family 468 1|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 44249 1|2|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 genus 75 1|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 genus 469 1|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species 1406 1|2|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 species 366602 1|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 species 470 1|2|1224|1236|72274|468|469|470 Acinetobacter baumannii 12 0 0 12 12.12121 running `ganon classify` or `ganon report` with `--ranks all`, the output will show all ranks used for classification and presented sorted by lineage (also available with `ganon report --sort lineage`): unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 no rank 131567 1|131567 cellular organisms 0 0 97 97 97.97980 superkingdom 2 1|131567|2 Bacteria 0 0 97 97 97.97980 phylum 1224 1|131567|2|1224 Proteobacteria 0 0 40 40 40.40404 class 1236 1|131567|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 72274 1|131567|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 468 1|131567|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 469 1|131567|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species group 909768 1|131567|2|1224|1236|72274|468|469|909768 Acinetobacter calcoaceticus/baumannii complex 0 0 12 12 12.12121 species 470 1|131567|2|1224|1236|72274|468|469|909768|470 Acinetobacter baumannii 12 0 0 12 12.12121 class 28211 1|131567|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 order 204458 1|131567|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 family 76892 1|131567|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 genus 75 1|131567|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 species 366602 1|131567|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 no rank 1783272 1|131567|2|1783272 Terrabacteria group 0 0 57 57 57.57576 phylum 1239 1|131567|2|1783272|1239 Firmicutes 0 0 57 57 57.57576 class 91061 1|131567|2|1783272|1239|91061 Bacilli 0 0 57 57 57.57576 order 1385 1|131567|2|1783272|1239|91061|1385 Bacillales 0 0 57 57 57.57576 family 186822 1|131567|2|1783272|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 genus 44249 1|131567|2|1783272|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 species 1406 1|131567|2|1783272|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 with `--output-format bioboxes` @Version:0.10.0 @SampleID:example.rep H1 @Ranks:superkingdom|phylum|class|order|family|genus|species|assembly @Taxonomy:db.tax @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 100.00000 1224 phylum 2|1224 Bacteria|Proteobacteria 56.89782 201174 phylum 2|201174 Bacteria|Actinobacteria 21.84869 1239 phylum 2|1239 Bacteria|Firmicutes 9.75197 976 phylum 2|976 Bacteria|Bacteroidota 6.15297 1117 phylum 2|1117 Bacteria|Cyanobacteria 2.23146 203682 phylum 2|203682 Bacteria|Planctomycetota 1.23353 57723 phylum 2|57723 Bacteria|Acidobacteria 0.52549 200795 phylum 2|200795 Bacteria|Chloroflexi 0.31118","title":"Output files"},{"location":"outputfiles/#output-files","text":"","title":"Output files"},{"location":"outputfiles/#ganon-buildbuild-customupdate","text":"Every run on ganon build , ganon build-custom or ganon update will generate the following database files: {prefix} .ibf/.hibf : main bloom filter index file, extension based on the --filter-type option. {prefix} .tax : taxonomy tree, only generated if --taxonomy is used (fields: target/node, parent, rank, name, genome size) . {prefix} _files/ : ( ganon build only) folder containing downloaded reference sequence and auxiliary files. Not necessary for classification. Keep this folder if the database will be update later. Otherwise it can be deleted. Warning Database files generated with version 1.2.0 or higher are not compatible with older versions.","title":"ganon build/build-custom/update"},{"location":"outputfiles/#ganon-classify","text":"{prefix} .tre : full report file (see below) {prefix} .rep : plain report of the run with only targets that received a match. Can be used to re-generate full reports (.tre) with ganon report . At the end prints 2 extra lines with #total_classified and #total_unclassified . Fields 1: hierarchy label 2: target 3: # total matches 4: # unique reads 5: # lca reads 6: rank 7: name {prefix} .one : output with one match for each classified read after EM or LCA algorithm. Only generated with --output-one active. If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.one (fields: read identifier, target, (max) k-mer/minimizer count) {prefix} .all : output with all matches for each read. Only generated with --output-all active Warning: file can be very large . If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.all (fields: read identifier, target, k-mer/minimizer count)","title":"ganon classify"},{"location":"outputfiles/#ganon-report","text":"{prefix} .tre : tab-separated tree-like report with cumulative counts and taxonomic lineage. There are several possible --report-type . More information on the different types of reports can be found here : abundance : will attempt to estimate taxonomic abundances by re-disributing read counts from LCA matches and correcting sequence abundance by approximate genome sizes. reads : sequence abundances , reports the proportion of sequences assigned to a taxa, each read classified is counted once. dist : like reads with read count re-distribution. corr : like reads with correction by genome size. matches : every match is reported to their original target, including multiple and shared matches. Each line in this report is a taxonomic entry (including the root node), with the following fields: col field obs example 1 rank phylum 2 target taxonomic id. or specialization (assembly id.) 562 3 lineage 1|131567|2|1224|28211|766|942|768|769 4 name Chromobacterium rhizoryzae 5 # unique number of reads that matched exclusively to this target 5 6 # shared number of reads with non-unique matches directly assigned to this target. Represents the LCA matches ( --report-type reads ), re-assigned matches ( --report-type abundance/dist ) or shared matches ( --report-type matches ) 10 7 # children number of unique and shared assignments to all children nodes of this target 20 8 # cumulative the sum of the unique, shared and children assignments up-to this target 35 9 % cumulative percentage of assignments or estimated relative abundance for --report-type abundance 43.24 The first line of the report file will show the number of unclassified reads (not for --report-type matches ) The CAMI challenge bioboxes profiling format is supported using --output-format bioboxes . In this format, only values for the percentage/abundance (col. 9) are reported. The root node and unclassified entries are omitted. The sum of cumulative assignments for the unclassified and root lines is 100%. The final cumulative sum of reads/matches may be under 100% if any filter is successfully applied and/or hierarchical selection is selected (keep/skip/split). For all report type but matches , only taxa that received direct read matches, either unique or by LCA assignment, are considered. Some reads may have only shared matches and will not be reported directly but will be accounted for on some parent level. To visualize those matches, create a report with --report-type matches or use directly the file {prefix} .rep .","title":"ganon report"},{"location":"outputfiles/#ganon-table","text":"{output_file}: a tab-separated file with counts/percentages of taxa for multiple samples Examples of output files The main output file is the `{prefix}.tre` which will summarize the results: unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 superkingdom 2 1|2 Bacteria 0 0 97 97 97.97980 phylum 1239 1|2|1239 Firmicutes 0 0 57 57 57.57576 phylum 1224 1|2|1224 Proteobacteria 0 0 40 40 40.40404 class 91061 1|2|1239|91061 Bacilli 0 0 57 57 57.57576 class 28211 1|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 class 1236 1|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 1385 1|2|1239|91061|1385 Bacillales 0 0 57 57 57.57576 order 204458 1|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 order 72274 1|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 186822 1|2|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 family 76892 1|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 family 468 1|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 44249 1|2|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 genus 75 1|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 genus 469 1|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species 1406 1|2|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 species 366602 1|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 species 470 1|2|1224|1236|72274|468|469|470 Acinetobacter baumannii 12 0 0 12 12.12121 running `ganon classify` or `ganon report` with `--ranks all`, the output will show all ranks used for classification and presented sorted by lineage (also available with `ganon report --sort lineage`): unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 no rank 131567 1|131567 cellular organisms 0 0 97 97 97.97980 superkingdom 2 1|131567|2 Bacteria 0 0 97 97 97.97980 phylum 1224 1|131567|2|1224 Proteobacteria 0 0 40 40 40.40404 class 1236 1|131567|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 72274 1|131567|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 468 1|131567|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 469 1|131567|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species group 909768 1|131567|2|1224|1236|72274|468|469|909768 Acinetobacter calcoaceticus/baumannii complex 0 0 12 12 12.12121 species 470 1|131567|2|1224|1236|72274|468|469|909768|470 Acinetobacter baumannii 12 0 0 12 12.12121 class 28211 1|131567|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 order 204458 1|131567|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 family 76892 1|131567|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 genus 75 1|131567|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 species 366602 1|131567|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 no rank 1783272 1|131567|2|1783272 Terrabacteria group 0 0 57 57 57.57576 phylum 1239 1|131567|2|1783272|1239 Firmicutes 0 0 57 57 57.57576 class 91061 1|131567|2|1783272|1239|91061 Bacilli 0 0 57 57 57.57576 order 1385 1|131567|2|1783272|1239|91061|1385 Bacillales 0 0 57 57 57.57576 family 186822 1|131567|2|1783272|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 genus 44249 1|131567|2|1783272|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 species 1406 1|131567|2|1783272|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 with `--output-format bioboxes` @Version:0.10.0 @SampleID:example.rep H1 @Ranks:superkingdom|phylum|class|order|family|genus|species|assembly @Taxonomy:db.tax @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 100.00000 1224 phylum 2|1224 Bacteria|Proteobacteria 56.89782 201174 phylum 2|201174 Bacteria|Actinobacteria 21.84869 1239 phylum 2|1239 Bacteria|Firmicutes 9.75197 976 phylum 2|976 Bacteria|Bacteroidota 6.15297 1117 phylum 2|1117 Bacteria|Cyanobacteria 2.23146 203682 phylum 2|203682 Bacteria|Planctomycetota 1.23353 57723 phylum 2|57723 Bacteria|Acidobacteria 0.52549 200795 phylum 2|200795 Bacteria|Chloroflexi 0.31118","title":"ganon table"},{"location":"reports/","text":"Reports :) ganon report filters and generates several reports and summaries from the results obtained with ganon classify . It is possible to summarize the results in terms of taxonomic and sequence abundances as well as total number of matches. Examples :) Given the output .rep from ganon classify and the database used ( --db-prefix ): Taxonomic profile with abundance estimation (default) :) ganon report --db-prefix mydb --input results.rep --output-prefix tax_profile --report-type abundance Sequence profile :) ganon report --db-prefix mydb --input results.rep --output-prefix seq_profile --report-type reads Matches profile :) ganon report --db-prefix mydb --input results.rep --output-prefix matches --report-type matches Filtering results :) ganon report --db-prefix mydb --input results.rep --output-prefix filtered --min-count 0.0005 --top-percentile 0.8 This will keep only results with a min. abundance of 0.05% and only the top 80% most abundant. Parameter details :) report type (--report-type) :) Several reports are available with --report-type : reads , abundance , dist , corr , matches : reads reports sequence abundances which are the basic proportion of reads classified in the sample. abundance will convert sequence abundance into taxonomic abundances by re-distributing read counts among leaf nodes and correcting by genome size. The re-distribution applies for reads classified with a LCA assignment and it is proportional to the number of unique matches of leaf nodes available in the ganon database (relative to the LCA node). Genome size is estimated based on NCBI or GTDB auxiliary files . Genome size correction is applied by rank based on default ranks only (superkingdom phylum class order family genus species assembly). Read counts in intermediate ranks will be corrected based on the closest parent default rank and re-assigned to its original rank. dist is the same of reads with read count re-distribution corr is the same of reads with correction by genome size matches will report the total number of matches classified, either unique or shared. This option will output the total number of matches instead the total number of reads","title":"Reports (ganon report)"},{"location":"reports/#reports","text":"ganon report filters and generates several reports and summaries from the results obtained with ganon classify . It is possible to summarize the results in terms of taxonomic and sequence abundances as well as total number of matches.","title":"Reports"},{"location":"reports/#examples","text":"Given the output .rep from ganon classify and the database used ( --db-prefix ):","title":"Examples"},{"location":"reports/#taxonomic-profile-with-abundance-estimation-default","text":"ganon report --db-prefix mydb --input results.rep --output-prefix tax_profile --report-type abundance","title":"Taxonomic profile with abundance estimation (default)"},{"location":"reports/#sequence-profile","text":"ganon report --db-prefix mydb --input results.rep --output-prefix seq_profile --report-type reads","title":"Sequence profile"},{"location":"reports/#matches-profile","text":"ganon report --db-prefix mydb --input results.rep --output-prefix matches --report-type matches","title":"Matches profile"},{"location":"reports/#filtering-results","text":"ganon report --db-prefix mydb --input results.rep --output-prefix filtered --min-count 0.0005 --top-percentile 0.8 This will keep only results with a min. abundance of 0.05% and only the top 80% most abundant.","title":"Filtering results"},{"location":"reports/#parameter-details","text":"","title":"Parameter details"},{"location":"reports/#report-type-report-type","text":"Several reports are available with --report-type : reads , abundance , dist , corr , matches : reads reports sequence abundances which are the basic proportion of reads classified in the sample. abundance will convert sequence abundance into taxonomic abundances by re-distributing read counts among leaf nodes and correcting by genome size. The re-distribution applies for reads classified with a LCA assignment and it is proportional to the number of unique matches of leaf nodes available in the ganon database (relative to the LCA node). Genome size is estimated based on NCBI or GTDB auxiliary files . Genome size correction is applied by rank based on default ranks only (superkingdom phylum class order family genus species assembly). Read counts in intermediate ranks will be corrected based on the closest parent default rank and re-assigned to its original rank. dist is the same of reads with read count re-distribution corr is the same of reads with correction by genome size matches will report the total number of matches classified, either unique or shared. This option will output the total number of matches instead the total number of reads","title":"report type (--report-type)"},{"location":"start/","text":"Quick Start Guide :) Install :) conda install -c bioconda -c conda-forge ganon Download and Build a database :) Bacteria - NCBI RefSeq - representative genomes ganon build --db-prefix bac_rs_rg --source refseq --organism-group bacteria --representative-genomes --threads 24 If you want to test ganon functionalities with a smaller database, use archaea instead of bacteria in the example above. Classify and generate a tax. profile :) Download test reads ganon classify --db-prefix bac_rs_rg --output-prefix classify_results --single-reads H01_1M_0.1.fq.gz --threads 24 classify_results.tre -> taxonomic profile Important parameters :) The most important parameters and trade-offs to be aware of when using ganon: ganon build :) --level : Highest level to build the database. Can be a taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis. The more specific the level, the bigger the database will be. --max-fp : controls the false positive of the bloom filters. The higher the --max-fp , the smaller the databases at a cost of sensitivity in classification. --window-size --kmer-size : the window value should always be the same or larger than the k-mer value. The larger the difference between them, the smaller the database will be. However, some sensitivity/precision loss in classification is expected with small k-mer and/or large window . Larger k-mer values (e.g. 31 ) will improve classification, specially read binning, at a cost of larger databases. ganon classify :) --rel-cutoff : defines the min. percentage of k-mers shared to a reference to consider a match. Higher values will improve precision and decrease sensitivity. For taxonomic profiling, a higher value between 0.4 and 0.8 may provide better results. For read binning, lower values between 0.2 and 0.4 are recommended. lower values -> more read matches higher values -> less read matches --rel-filter : filter matches in relation to the best and worst after the cutoff is applied. 0 means only matches with top score (# of k-mers ) as the best match will be kept. lower values -> more unique matching reads higher values -> more multi-matching reads --multiple-matches : defines how ganon treats multiple-matching reads. Either by an EM-algorithm based on unique matches or a taxonomy-based LCA algorithm. ganon report :) --report-type : reports either taxonomic, sequence or matches abundances. Use corr or abundance for taxonomic profiling, reads or dist for sequence profiling and matches to report a summary of all matches. --min-count : cutoff to discard underrepresented taxa. Useful to remove the common long tail of spurious matches and false positives when performing classification. Values between 0.0001 (0.01%) and 0.001 (0.1%) improved sensitivity and precision in our evaluations. The higher the value, the more precise the outcome, with a sensitivity loss. Alternatively --top-percentile can be used to keep a relative amount of taxa instead a hard cutoff. The numeric values above are averages from several experiments with different sample types and database contents. They may not work as expected for your data. If you are not sure which values to use or see something unexpected, please open an issue .","title":"Quick Start"},{"location":"start/#quick-start-guide","text":"","title":"Quick Start Guide"},{"location":"start/#install","text":"conda install -c bioconda -c conda-forge ganon","title":"Install"},{"location":"start/#download-and-build-a-database","text":"Bacteria - NCBI RefSeq - representative genomes ganon build --db-prefix bac_rs_rg --source refseq --organism-group bacteria --representative-genomes --threads 24 If you want to test ganon functionalities with a smaller database, use archaea instead of bacteria in the example above.","title":"Download and Build a database"},{"location":"start/#classify-and-generate-a-tax-profile","text":"Download test reads ganon classify --db-prefix bac_rs_rg --output-prefix classify_results --single-reads H01_1M_0.1.fq.gz --threads 24 classify_results.tre -> taxonomic profile","title":"Classify and generate a tax. profile"},{"location":"start/#important-parameters","text":"The most important parameters and trade-offs to be aware of when using ganon:","title":"Important parameters"},{"location":"start/#ganon-build","text":"--level : Highest level to build the database. Can be a taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis. The more specific the level, the bigger the database will be. --max-fp : controls the false positive of the bloom filters. The higher the --max-fp , the smaller the databases at a cost of sensitivity in classification. --window-size --kmer-size : the window value should always be the same or larger than the k-mer value. The larger the difference between them, the smaller the database will be. However, some sensitivity/precision loss in classification is expected with small k-mer and/or large window . Larger k-mer values (e.g. 31 ) will improve classification, specially read binning, at a cost of larger databases.","title":"ganon build"},{"location":"start/#ganon-classify","text":"--rel-cutoff : defines the min. percentage of k-mers shared to a reference to consider a match. Higher values will improve precision and decrease sensitivity. For taxonomic profiling, a higher value between 0.4 and 0.8 may provide better results. For read binning, lower values between 0.2 and 0.4 are recommended. lower values -> more read matches higher values -> less read matches --rel-filter : filter matches in relation to the best and worst after the cutoff is applied. 0 means only matches with top score (# of k-mers ) as the best match will be kept. lower values -> more unique matching reads higher values -> more multi-matching reads --multiple-matches : defines how ganon treats multiple-matching reads. Either by an EM-algorithm based on unique matches or a taxonomy-based LCA algorithm.","title":"ganon classify"},{"location":"start/#ganon-report","text":"--report-type : reports either taxonomic, sequence or matches abundances. Use corr or abundance for taxonomic profiling, reads or dist for sequence profiling and matches to report a summary of all matches. --min-count : cutoff to discard underrepresented taxa. Useful to remove the common long tail of spurious matches and false positives when performing classification. Values between 0.0001 (0.01%) and 0.001 (0.1%) improved sensitivity and precision in our evaluations. The higher the value, the more precise the outcome, with a sensitivity loss. Alternatively --top-percentile can be used to keep a relative amount of taxa instead a hard cutoff. The numeric values above are averages from several experiments with different sample types and database contents. They may not work as expected for your data. If you are not sure which values to use or see something unexpected, please open an issue .","title":"ganon report"},{"location":"table/","text":"Table :) ganon table filters and summarizes several reports obtained with ganon report into a table. Filters for each sample or for averages among all samples can also be applied. Examples :) Given several .tre from ganon report : Counts of species :) ganon table --input *.tre --output-file table.tsv --rank species Abundance of species :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species Top 10 species (among all samples) :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-all 10 Top 10 species (from each samples) :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-sample 10 Filtering results :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --min-count 0.0005 This will keep only results with a min. abundance of 0.05% .","title":"Table (ganon table)"},{"location":"table/#table","text":"ganon table filters and summarizes several reports obtained with ganon report into a table. Filters for each sample or for averages among all samples can also be applied.","title":"Table"},{"location":"table/#examples","text":"Given several .tre from ganon report :","title":"Examples"},{"location":"table/#counts-of-species","text":"ganon table --input *.tre --output-file table.tsv --rank species","title":"Counts of species"},{"location":"table/#abundance-of-species","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species","title":"Abundance of species"},{"location":"table/#top-10-species-among-all-samples","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-all 10","title":"Top 10 species (among all samples)"},{"location":"table/#top-10-species-from-each-samples","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-sample 10","title":"Top 10 species (from each samples)"},{"location":"table/#filtering-results","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --min-count 0.0005 This will keep only results with a min. abundance of 0.05% .","title":"Filtering results"},{"location":"tutorials/","text":"Tutorials :) ... soon ...","title":"Tutorials"},{"location":"tutorials/#tutorials","text":"... soon ...","title":"Tutorials"}]}
\ No newline at end of file
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"ganon :) Code: GitHub repository ganon2 pre-print ganon is designed to index large sets of genomic reference sequences and to classify reads against them efficiently. The tool uses Hierarchical Interleaved Bloom Filters as indices based on k-mers with optional minimizers. It was mainly developed, but not limited, to the metagenomics classification problem: quickly assign sequence fragments to their closest reference among thousands of references. After classification, taxonomic or sequence abundances are estimated and reported. Features :) integrated download and build of any subset from RefSeq/Genbank/GTDB with incremental updates NCBI and GTDB native support for taxonomic classification, custom taxonomy or no taxonomy at all customizable database build for local or non-standard sequence files optimized taxonomic binning and profiling configurations build and classify at various taxonomic levels, strain, assembly, file, sequence or custom specialization hierarchical classification using several databases in one or more levels in just one run EM and/or LCA algorithms to solve multiple-matching reads reporting of multiple and unique matches for every read reporting of sequence, taxonomic or multi-match abundances with optional genome size correction advanced tree-like reports with several filter options generation of contingency tables with several filters for multi-sample studies ganon achieved very good results in our own evaluations but also in independent evaluations: LEMMI , LEMMI v2 and CAMI2 Installation with conda :) The easiest way to install ganon is via conda, using the bioconda and conda-forge channels: conda install -c bioconda -c conda-forge ganon However, there are possible performance benefits compiling ganon from source in the target machine rather than using the conda version. To do so, please follow the instructions below: Installation from source :) Python dependencies :) python >=3.6 pandas >=1.2.0 multitax >=1.3.1 genome_updater >=0.6.3 # Python version should be >=3.6 python3 -V # Install packages via pip or conda: # PIP python3 -m pip install \"pandas>=1.2.0\" \"multitax>=1.3.1\" wget --quiet --show-progress https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh && chmod +x genome_updater.sh # Conda/Mamba (alternative) conda install -c bioconda -c conda-forge \"pandas>=1.2.0\" \"multitax>=1.3.1\" \"genome_updater>=0.6.3\" C++ dependencies :) GCC >=11 CMake >=3.4 zlib bzip2 raptor ==3.0.1 Tip If your system has GCC version 10 or below, you can create an environment with the latest conda-forge GCC version and dependencies: conda create -c conda-forge -n gcc-conda gcc gxx zlib bzip2 cmake and activate the environment with: source activate gcc-conda . In CMake, you may have set the environment include directory with the following parameter: -DSEQAN3_CXX_FLAGS=\"-I/path/to/miniconda3/envs/gcc-conda/include/\" changing /path/to/miniconda3 with your local path to the conda installation. Downloading and building ganon + submodules :) git clone --recurse-submodules https://github.com/pirovc/ganon.git # Install Python side cd ganon python3 setup.py install --record files.txt # optional # Compile and install C++ side mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DVERBOSE_CONFIG=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCONDA=OFF -DLONGREADS=OFF .. make -j 4 sudo make install # optional to change install location (e.g. /myprefix/bin/ ), set the installation prefix in the cmake command with -DCMAKE_INSTALL_PREFIX=/myprefix/ use -DINCLUDE_DIRS to set alternative paths to cxxopts and Catch2 libs. to classify extremely large reads or contigs that would need more than 65000 k-mers, use -DLONGREADS=ON Installing raptor :) The easiest way to install raptor is via conda with conda install -c bioconda -c conda-forge \"raptor=3.0.1\" (already included in ganon install via conda). Note raptor is required to build databases with the Hierarchical Interleaved Bloom Filter ( ganon build --filter-type hibf ) To build old style ganon indices ganon build --filter-type ibf , raptor is not required To install raptor from source, follow the instructions below: Dependencies :) CMake >= 3.18 GCC 11, 12 or 13 (most recent minor version) Downloading and building raptor + submodules :) git clone --branch raptor-v3.0.1 --recurse-submodules https://github.com/seqan/raptor cd raptor mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=\"-std=c++23 -Wno-interference-size\" .. make -j 4 binaries will be located in the bin directory you may have to inform ganon build the path to the binaries with --raptor-path raptor/build/bin Testing :) If everything was properly installed, the following command should show the help pages without errors: ganon -h Running tests :) python3 -m pip install \"parameterized>=0.9.0\" # Alternative: conda install -c conda-forge \"parameterized>=0.9.0\" python3 -m unittest discover -s tests/ganon/integration/ python3 -m unittest discover -s tests/ganon/integration_online/ # optional - downloads large files cd build/ ctest -VV . Parameters :) usage: ganon [-h] [-v] {build,build-custom,update,classify,reassign,report,table} ... - - - - - - - - - - _ _ _ _ _ (_|(_|| |(_)| | _| v. 2.1.0 - - - - - - - - - - positional arguments: {build,build-custom,update,classify,reassign,report,table} build Download and build ganon default databases (refseq/genbank) build-custom Build custom ganon databases update Update ganon default databases classify Classify reads against built databases reassign Reassign reads with multiple matches with an EM algorithm report Generate reports from classification results table Generate table from reports options: -h, --help show this help message and exit -v, --version Show program's version number and exit. ganon build usage: ganon build [-h] [-g [...]] [-a [...]] [-l] [-b [...]] [-o] [-c] [-r] [-u] [-m [...]] [-z [...]] [--skip-genome-size] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -g [ ...], --organism-group [ ...] One or more organism groups to download [archaea, bacteria, fungi, human, invertebrate, metagenomes, other, plant, protozoa, vertebrate_mammalian, vertebrate_other, viral]. Mutually exclusive --taxid (default: None) -a [ ...], --taxid [ ...] One or more taxonomic identifiers to download. e.g. 562 (-x ncbi) or 's__Escherichia coli' (-x gtdb). Mutually exclusive --organism-group (default: None) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) database arguments: -l , --level Highest level to build the database. Options: any available taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis (default: species) download arguments: -b [ ...], --source [ ...] Source to download [refseq, genbank] (default: ['refseq']) -o , --top Download limited assemblies for each taxa. 0 for all. (default: 0) -c, --complete-genomes Download only sub-set of complete genomes (default: False) -r, --representative-genomes Download only sub-set of representative genomes (default: False) -u , --genome-updater Additional genome_updater parameters (https://github.com/pirovc/genome_updater) (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon build-custom usage: ganon build-custom [-h] [-i [...]] [-e] [-c] [-n] [-a] [-l] [-m [...]] [-z [...]] [--skip-genome-size] [-r [...]] [-q [...]] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). Mutually exclusive --input-file. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: fna.gz) -c, --input-recursive Look for files recursively in folder(s) provided with --input (default: False) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) custom arguments: -n , --input-file Tab-separated file with all necessary file/sequence information. Fields: file [ target node specialization specialization name]. For details: https://pirovc.github.io/ganon/custom_databases/. Mutually exclusive --input (default: None) -a , --input-target Target to use [file, sequence]. Parse input by file or by sequence. Using 'file' is recommended and will speed-up the building process (default: file) -l , --level Max. level to build the database. By default, --level is the --input-target. Options: any available taxonomic rank [species, genus, ...] or 'leaves' (requires --taxonomy). Further specialization options [assembly, custom]. assembly will retrieve and use the assembly accession and name. custom requires and uses the specialization field in the --input-file. (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) ncbi arguments: -r [ ...], --ncbi-sequence-info [ ...] Uses NCBI e-utils webservices or downloads accession2taxid files to extract target information. [eutils, nucl_gb, nucl_wgs, nucl_est, nucl_gss, pdb, prot, dead_nucl, dead_wgs, dead_prot or one or more accession2taxid files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/]. By default uses e-utils up-to 50000 sequences or downloads nucl_gb nucl_wgs otherwise. (default: []) -q [ ...], --ncbi-file-info [ ...] Downloads assembly_summary files to extract target information. [refseq, genbank, refseq_historical, genbank_historical or one or more assembly_summary files from https://ftp.ncbi.nlm.nih.gov/genomes/] (default: ['refseq', 'genbank']) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon update usage: ganon update [-h] -d DB_PREFIX [-o] [-t] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -d DB_PREFIX, --db-prefix DB_PREFIX Existing database input prefix (default: None) important arguments: -o , --output-db-prefix Output database prefix. By default will be the same as --db-prefix and overwrite files (default: None) -t , --threads optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon classify usage: ganon classify [-h] -d [DB_PREFIX ...] [-s [reads.fq[.gz] ...]] [-p [reads.1.fq[.gz] reads.2.fq[.gz] ...]] [-c [...]] [-e [...]] [-m] [--ranks [...]] [--min-count] [--report-type] [--skip-report] [-o] [--output-one] [--output-all] [--output-unclassified] [--output-single] [-t] [-b] [-f [...]] [-l [...]] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -d [DB_PREFIX ...], --db-prefix [DB_PREFIX ...] Database input prefix[es] (default: None) -s [reads.fq[.gz] ...], --single-reads [reads.fq[.gz] ...] Multi-fastq[.gz] file[s] to classify (default: None) -p [reads.1.fq[.gz] reads.2.fq[.gz] ...], --paired-reads [reads.1.fq[.gz] reads.2.fq[.gz] ...] Multi-fastq[.gz] pairs of file[s] to classify (default: None) cutoff/filter arguments: -c [ ...], --rel-cutoff [ ...] Min. percentage of a read (set of k-mers) shared with a reference necessary to consider a match. Generally used to remove low similarity matches. Single value or one per database (e.g. 0.7 1 0.25). 0 for no cutoff (default: [0.75]) -e [ ...], --rel-filter [ ...] Additional relative percentage of matches (relative to the best match) to keep. Generally used to keep top matches above cutoff. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [0.1]) post-processing/report arguments: -m , --multiple-matches Method to solve reads with multiple matches [em, lca, skip]. em -> expectation maximization algorithm based on unique matches. lca -> lowest common ancestor based on taxonomy. The EM algorithm can be executed later with 'ganon reassign' using the .all file (--output-all). (default: em) --ranks [ ...] Ranks to report taxonomic abundances (.tre). empty will report default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) --min-count Minimum percentage/counts to report an taxa (.tre) [use values between 0-1 for percentage, >1 for counts] (default: 5e-05) --report-type Type of report (.tre) [abundance, reads, matches, dist, corr]. More info in 'ganon report'. (default: abundance) --skip-report Disable tree-like report (.tre) at the end of classification. Can be done later with 'ganon report'. (default: False) output arguments: -o , --output-prefix Output prefix for output (.rep) and tree-like report (.tre). Empty to output to STDOUT (only .rep) (default: None) --output-one Output a file with one match for each read (.one) either an unique match or a result from the EM or a LCA algorithm (--multiple-matches) (default: False) --output-all Output a file with all unique and multiple matches (.all) (default: False) --output-unclassified Output a file with unclassified read headers (.unc) (default: False) --output-single When using multiple hierarchical levels, output everything in one file instead of one per hierarchy (default: False) other arguments: -t , --threads Number of sub-processes/threads to use (default: 1) -b, --binning Optimized parameters for binning (--rel-cutoff 0.25 --rel-filter 0 --min-count 0 --report-type reads). Will report sequence abundances (.tre) instead of tax. abundance. (default: False) -f [ ...], --fpr-query [ ...] Max. false positive of a query to accept a match. Applied after --rel-cutoff and --rel-filter. Generally used to remove false positives matches querying a database build with large --max-fp. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [1e-05]) -l [ ...], --hierarchy-labels [ ...] Hierarchy definition of --db-prefix files to be classified. Can also be a string, but input will be sorted to define order (e.g. 1 1 2 3). The default value reported without hierarchy is 'H1' (default: None) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon reassign usage: ganon reassign [-h] -i -o OUTPUT_PREFIX [-e] [-s] [--remove-all] [--skip-one] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -i , --input-prefix Input prefix to find files from ganon classify (.all and optionally .rep) (default: None) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for reassigned file (.one and optionally .rep). In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.out' (default: None) EM arguments: -e , --max-iter Max. number of iterations for the EM algorithm. If 0, will run until convergence (check --threshold) (default: 10) -s , --threshold Convergence threshold limit to stop the EM algorithm. (default: 0) other arguments: --remove-all Remove input file (.all) after processing. (default: False) --skip-one Do not write output file (.one) after processing. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon report usage: ganon report [-h] -i [...] [-e INPUT_EXTENSION] -o OUTPUT_PREFIX [-d [...]] [-x] [-m [...]] [-z [...]] [--skip-genome-size] [-f] [-t] [-r [...]] [-s] [-a] [-y] [-p [...]] [-k [...]] [-c] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.rep' file(s) from ganon classify. (default: None) -e INPUT_EXTENSION, --input-extension INPUT_EXTENSION Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: rep) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for report file 'output_prefix.tre'. In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.tre' (default: None) db/tax arguments: -d [ ...], --db-prefix [ ...] Database prefix(es) used for classification. Only '.tax' file(s) are required. If not provided, new taxonomy will be downloaded. Mutually exclusive with --taxonomy. (default: []) -x , --taxonomy Taxonomy database to use [ncbi, gtdb, skip]. Mutually exclusive with --db-prefix. (default: ncbi) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Valid only without --db-prefix. Activate this option when using sequences not representing full genomes. (default: False) output arguments: -f , --output-format Output format [text, tsv, csv, bioboxes]. text outputs a tabulated formatted text file for better visualization. bioboxes is the the CAMI challenge profiling format (only percentage/abundances are reported). (default: tsv) -t , --report-type Type of report [abundance, reads, matches, dist, corr]. 'abundance' -> tax. abundance (re- distribute read counts and correct by genome size), 'reads' -> sequence abundance, 'matches' -> report all unique and shared matches, 'dist' -> like reads with re-distribution of shared read counts only, 'corr' -> like abundance without re-distribution of shared read counts (default: abundance) -r [ ...], --ranks [ ...] Ranks to report ['', 'all', custom list]. 'all' for all possible ranks. empty for default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) -s , --sort Sort report by [rank, lineage, count, unique]. Default: rank (with custom --ranks) or lineage (with --ranks all) (default: ) -a, --no-orphan Omit orphan nodes from the final report. Otherwise, orphan nodes (= nodes not found in the db/tax) are reported as 'na' with root as direct parent. (default: False) -y, --split-hierarchy Split output reports by hierarchy (from ganon classify --hierarchy-labels). If activated, the output files will be named as '{output_prefix}.{hierarchy}.tre' (default: False) -p [ ...], --skip-hierarchy [ ...] One or more hierarchies to skip in the report (from ganon classify --hierarchy-labels) (default: []) -k [ ...], --keep-hierarchy [ ...] One or more hierarchies to keep in the report (from ganon classify --hierarchy-labels) (default: []) -c , --top-percentile Top percentile filter, based on percentage/relative abundance. Applied only at default ranks [superkingdom, phylum, class, order, family, genus, species, assembly] (default: 0) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: []) ganon table usage: ganon table [-h] -i [...] [-e] -o OUTPUT_FILE [-l] [-f] [-t] [-a] [-m] [-r] [-n] [--header] [--unclassified-label] [--filtered-label] [--skip-zeros] [--transpose] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.tre' file(s) from ganon report. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: tre) -o OUTPUT_FILE, --output-file OUTPUT_FILE Output filename for the table (default: None) output arguments: -l , --output-value Output value on the table [percentage, counts]. percentage values are reported between [0-1] (default: counts) -f , --output-format Output format [tsv, csv] (default: tsv) -t , --top-sample Top hits of each sample individually (default: 0) -a , --top-all Top hits of all samples (ranked by percentage) (default: 0) -m , --min-frequency Minimum number/percentage of files containing an taxa to keep the taxa [values between 0-1 for percentage, >1 specific number] (default: 0) -r , --rank Define specific rank to report. Empty will report all ranks. (default: None) -n, --no-root Do not report root node entry and lineage. Direct and shared matches to root will be accounted as unclassified (default: False) --header Header information [name, taxid, lineage] (default: name) --unclassified-label Add column with unclassified count/percentage with the chosen label. May be the same as --filtered-label (e.g. unassigned) (default: None) --filtered-label Add column with filtered count/percentage with the chosen label. May be the same as --unclassified-label (e.g. unassigned) (default: None) --skip-zeros Do not print lines with only zero count/percentage (default: False) --transpose Transpose output table (taxa as cols and files as rows) (default: False) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: [])","title":"ganon2"},{"location":"#ganon","text":"Code: GitHub repository ganon2 pre-print ganon is designed to index large sets of genomic reference sequences and to classify reads against them efficiently. The tool uses Hierarchical Interleaved Bloom Filters as indices based on k-mers with optional minimizers. It was mainly developed, but not limited, to the metagenomics classification problem: quickly assign sequence fragments to their closest reference among thousands of references. After classification, taxonomic or sequence abundances are estimated and reported.","title":"ganon"},{"location":"#features","text":"integrated download and build of any subset from RefSeq/Genbank/GTDB with incremental updates NCBI and GTDB native support for taxonomic classification, custom taxonomy or no taxonomy at all customizable database build for local or non-standard sequence files optimized taxonomic binning and profiling configurations build and classify at various taxonomic levels, strain, assembly, file, sequence or custom specialization hierarchical classification using several databases in one or more levels in just one run EM and/or LCA algorithms to solve multiple-matching reads reporting of multiple and unique matches for every read reporting of sequence, taxonomic or multi-match abundances with optional genome size correction advanced tree-like reports with several filter options generation of contingency tables with several filters for multi-sample studies ganon achieved very good results in our own evaluations but also in independent evaluations: LEMMI , LEMMI v2 and CAMI2","title":"Features"},{"location":"#installation-with-conda","text":"The easiest way to install ganon is via conda, using the bioconda and conda-forge channels: conda install -c bioconda -c conda-forge ganon However, there are possible performance benefits compiling ganon from source in the target machine rather than using the conda version. To do so, please follow the instructions below:","title":"Installation with conda"},{"location":"#installation-from-source","text":"","title":"Installation from source"},{"location":"#python-dependencies","text":"python >=3.6 pandas >=1.2.0 multitax >=1.3.1 genome_updater >=0.6.3 # Python version should be >=3.6 python3 -V # Install packages via pip or conda: # PIP python3 -m pip install \"pandas>=1.2.0\" \"multitax>=1.3.1\" wget --quiet --show-progress https://raw.githubusercontent.com/pirovc/genome_updater/master/genome_updater.sh && chmod +x genome_updater.sh # Conda/Mamba (alternative) conda install -c bioconda -c conda-forge \"pandas>=1.2.0\" \"multitax>=1.3.1\" \"genome_updater>=0.6.3\"","title":"Python dependencies"},{"location":"#c-dependencies","text":"GCC >=11 CMake >=3.4 zlib bzip2 raptor ==3.0.1 Tip If your system has GCC version 10 or below, you can create an environment with the latest conda-forge GCC version and dependencies: conda create -c conda-forge -n gcc-conda gcc gxx zlib bzip2 cmake and activate the environment with: source activate gcc-conda . In CMake, you may have set the environment include directory with the following parameter: -DSEQAN3_CXX_FLAGS=\"-I/path/to/miniconda3/envs/gcc-conda/include/\" changing /path/to/miniconda3 with your local path to the conda installation.","title":"C++ dependencies"},{"location":"#downloading-and-building-ganon-submodules","text":"git clone --recurse-submodules https://github.com/pirovc/ganon.git # Install Python side cd ganon python3 setup.py install --record files.txt # optional # Compile and install C++ side mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DVERBOSE_CONFIG=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCONDA=OFF -DLONGREADS=OFF .. make -j 4 sudo make install # optional to change install location (e.g. /myprefix/bin/ ), set the installation prefix in the cmake command with -DCMAKE_INSTALL_PREFIX=/myprefix/ use -DINCLUDE_DIRS to set alternative paths to cxxopts and Catch2 libs. to classify extremely large reads or contigs that would need more than 65000 k-mers, use -DLONGREADS=ON","title":"Downloading and building ganon + submodules"},{"location":"#installing-raptor","text":"The easiest way to install raptor is via conda with conda install -c bioconda -c conda-forge \"raptor=3.0.1\" (already included in ganon install via conda). Note raptor is required to build databases with the Hierarchical Interleaved Bloom Filter ( ganon build --filter-type hibf ) To build old style ganon indices ganon build --filter-type ibf , raptor is not required To install raptor from source, follow the instructions below:","title":"Installing raptor"},{"location":"#dependencies","text":"CMake >= 3.18 GCC 11, 12 or 13 (most recent minor version)","title":"Dependencies"},{"location":"#downloading-and-building-raptor-submodules","text":"git clone --branch raptor-v3.0.1 --recurse-submodules https://github.com/seqan/raptor cd raptor mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS=\"-std=c++23 -Wno-interference-size\" .. make -j 4 binaries will be located in the bin directory you may have to inform ganon build the path to the binaries with --raptor-path raptor/build/bin","title":"Downloading and building raptor + submodules"},{"location":"#testing","text":"If everything was properly installed, the following command should show the help pages without errors: ganon -h","title":"Testing"},{"location":"#running-tests","text":"python3 -m pip install \"parameterized>=0.9.0\" # Alternative: conda install -c conda-forge \"parameterized>=0.9.0\" python3 -m unittest discover -s tests/ganon/integration/ python3 -m unittest discover -s tests/ganon/integration_online/ # optional - downloads large files cd build/ ctest -VV .","title":"Running tests"},{"location":"#parameters","text":"usage: ganon [-h] [-v] {build,build-custom,update,classify,reassign,report,table} ... - - - - - - - - - - _ _ _ _ _ (_|(_|| |(_)| | _| v. 2.1.0 - - - - - - - - - - positional arguments: {build,build-custom,update,classify,reassign,report,table} build Download and build ganon default databases (refseq/genbank) build-custom Build custom ganon databases update Update ganon default databases classify Classify reads against built databases reassign Reassign reads with multiple matches with an EM algorithm report Generate reports from classification results table Generate table from reports options: -h, --help show this help message and exit -v, --version Show program's version number and exit. ganon build usage: ganon build [-h] [-g [...]] [-a [...]] [-l] [-b [...]] [-o] [-c] [-r] [-u] [-m [...]] [-z [...]] [--skip-genome-size] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -g [ ...], --organism-group [ ...] One or more organism groups to download [archaea, bacteria, fungi, human, invertebrate, metagenomes, other, plant, protozoa, vertebrate_mammalian, vertebrate_other, viral]. Mutually exclusive --taxid (default: None) -a [ ...], --taxid [ ...] One or more taxonomic identifiers to download. e.g. 562 (-x ncbi) or 's__Escherichia coli' (-x gtdb). Mutually exclusive --organism-group (default: None) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) database arguments: -l , --level Highest level to build the database. Options: any available taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis (default: species) download arguments: -b [ ...], --source [ ...] Source to download [refseq, genbank] (default: ['refseq']) -o , --top Download limited assemblies for each taxa. 0 for all. (default: 0) -c, --complete-genomes Download only sub-set of complete genomes (default: False) -r, --representative-genomes Download only sub-set of representative genomes (default: False) -u , --genome-updater Additional genome_updater parameters (https://github.com/pirovc/genome_updater) (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon build-custom usage: ganon build-custom [-h] [-i [...]] [-e] [-c] [-n] [-a] [-l] [-m [...]] [-z [...]] [--skip-genome-size] [-r [...]] [-q [...]] -d DB_PREFIX [-x] [-t] [-p] [-k] [-w] [-s] [-f] [-j] [-y] [-v] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). Mutually exclusive --input-file. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: fna.gz) -c, --input-recursive Look for files recursively in folder(s) provided with --input (default: False) -d DB_PREFIX, --db-prefix DB_PREFIX Database output prefix (default: None) custom arguments: -n , --input-file Tab-separated file with all necessary file/sequence information. Fields: file [ target node specialization specialization name]. For details: https://pirovc.github.io/ganon/custom_databases/. Mutually exclusive --input (default: None) -a , --input-target Target to use [file, sequence]. Parse input by file or by sequence. Using 'file' is recommended and will speed-up the building process (default: file) -l , --level Max. level to build the database. By default, --level is the --input-target. Options: any available taxonomic rank [species, genus, ...] or 'leaves' (requires --taxonomy). Further specialization options [assembly, custom]. assembly will retrieve and use the assembly accession and name. custom requires and uses the specialization field in the --input-file. (default: None) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Activate this option when using sequences not representing full genomes. (default: False) ncbi arguments: -r [ ...], --ncbi-sequence-info [ ...] Uses NCBI e-utils webservices or downloads accession2taxid files to extract target information. [eutils, nucl_gb, nucl_wgs, nucl_est, nucl_gss, pdb, prot, dead_nucl, dead_wgs, dead_prot or one or more accession2taxid files from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/]. By default uses e-utils up-to 50000 sequences or downloads nucl_gb nucl_wgs otherwise. (default: []) -q [ ...], --ncbi-file-info [ ...] Downloads assembly_summary files to extract target information. [refseq, genbank, refseq_historical, genbank_historical or one or more assembly_summary files from https://ftp.ncbi.nlm.nih.gov/genomes/] (default: ['refseq', 'genbank']) important arguments: -x , --taxonomy Set taxonomy to enable taxonomic classification, lca and reports [ncbi, gtdb, skip] (default: ncbi) -t , --threads advanced arguments: -p , --max-fp Max. false positive for bloom filters. Mutually exclusive --filter-size. Defaults to 0.001 with --filter-type hibf or 0.05 with --filter-type ibf. (default: None) -k , --kmer-size The k-mer size to split sequences. (default: 19) -w , --window-size The window-size to build filter with minimizers. (default: 31) -s , --hash-functions The number of hash functions for the interleaved bloom filter [1-5]. With --filter-type ibf, 0 will try to set optimal value. (default: 4) -f , --filter-size Fixed size for filter in Megabytes (MB). Mutually exclusive --max-fp. Only valid for --filter- type ibf. (default: 0) -j , --mode Create smaller or faster filters at the cost of classification speed or database size, respectively [avg, smaller, smallest, faster, fastest]. If --filter-size is used, smaller/smallest refers to the false positive rate. By default, an average value is calculated to balance classification speed and database size. Only valid for --filter-type ibf. (default: avg) -y , --min-length Skip sequences smaller then value defined. 0 to not skip any sequence. Only valid for --filter- type ibf. (default: 0) -v , --filter-type Variant of bloom filter to use [hibf, ibf]. hibf requires raptor >= v3.0.1 installed or binary path set with --raptor-path. --mode, --filter-size and --min-length will be ignored with hibf. hibf will set --max-fp 0.001 as default. (default: hibf) optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon update usage: ganon update [-h] -d DB_PREFIX [-o] [-t] [--restart] [--verbose] [--quiet] [--write-info-file] options: -h, --help show this help message and exit required arguments: -d DB_PREFIX, --db-prefix DB_PREFIX Existing database input prefix (default: None) important arguments: -o , --output-db-prefix Output database prefix. By default will be the same as --db-prefix and overwrite files (default: None) -t , --threads optional arguments: --restart Restart build/update from scratch, do not try to resume from the latest possible step. {db_prefix}_files/ will be deleted if present. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) --write-info-file Save copy of target info generated to {db_prefix}.info.tsv. Can be re-used as --input-file for further attempts. (default: False) ganon classify usage: ganon classify [-h] -d [DB_PREFIX ...] [-s [reads.fq[.gz] ...]] [-p [reads.1.fq[.gz] reads.2.fq[.gz] ...]] [-c [...]] [-e [...]] [-m] [--ranks [...]] [--min-count] [--report-type] [--skip-report] [-o] [--output-one] [--output-all] [--output-unclassified] [--output-single] [-t] [-b] [-f [...]] [-l [...]] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -d [DB_PREFIX ...], --db-prefix [DB_PREFIX ...] Database input prefix[es] (default: None) -s [reads.fq[.gz] ...], --single-reads [reads.fq[.gz] ...] Multi-fastq[.gz] file[s] to classify (default: None) -p [reads.1.fq[.gz] reads.2.fq[.gz] ...], --paired-reads [reads.1.fq[.gz] reads.2.fq[.gz] ...] Multi-fastq[.gz] pairs of file[s] to classify (default: None) cutoff/filter arguments: -c [ ...], --rel-cutoff [ ...] Min. percentage of a read (set of k-mers) shared with a reference necessary to consider a match. Generally used to remove low similarity matches. Single value or one per database (e.g. 0.7 1 0.25). 0 for no cutoff (default: [0.75]) -e [ ...], --rel-filter [ ...] Additional relative percentage of matches (relative to the best match) to keep. Generally used to keep top matches above cutoff. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [0.1]) post-processing/report arguments: -m , --multiple-matches Method to solve reads with multiple matches [em, lca, skip]. em -> expectation maximization algorithm based on unique matches. lca -> lowest common ancestor based on taxonomy. The EM algorithm can be executed later with 'ganon reassign' using the .all file (--output-all). (default: em) --ranks [ ...] Ranks to report taxonomic abundances (.tre). empty will report default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) --min-count Minimum percentage/counts to report an taxa (.tre) [use values between 0-1 for percentage, >1 for counts] (default: 5e-05) --report-type Type of report (.tre) [abundance, reads, matches, dist, corr]. More info in 'ganon report'. (default: abundance) --skip-report Disable tree-like report (.tre) at the end of classification. Can be done later with 'ganon report'. (default: False) output arguments: -o , --output-prefix Output prefix for output (.rep) and tree-like report (.tre). Empty to output to STDOUT (only .rep) (default: None) --output-one Output a file with one match for each read (.one) either an unique match or a result from the EM or a LCA algorithm (--multiple-matches) (default: False) --output-all Output a file with all unique and multiple matches (.all) (default: False) --output-unclassified Output a file with unclassified read headers (.unc) (default: False) --output-single When using multiple hierarchical levels, output everything in one file instead of one per hierarchy (default: False) other arguments: -t , --threads Number of sub-processes/threads to use (default: 1) -b, --binning Optimized parameters for binning (--rel-cutoff 0.25 --rel-filter 0 --min-count 0 --report-type reads). Will report sequence abundances (.tre) instead of tax. abundance. (default: False) -f [ ...], --fpr-query [ ...] Max. false positive of a query to accept a match. Applied after --rel-cutoff and --rel-filter. Generally used to remove false positives matches querying a database build with large --max-fp. Single value or one per hierarchy (e.g. 0.1 0). 1 for no filter (default: [1e-05]) -l [ ...], --hierarchy-labels [ ...] Hierarchy definition of --db-prefix files to be classified. Can also be a string, but input will be sorted to define order (e.g. 1 1 2 3). The default value reported without hierarchy is 'H1' (default: None) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon reassign usage: ganon reassign [-h] -i -o OUTPUT_PREFIX [-e] [-s] [--remove-all] [--skip-one] [--verbose] [--quiet] options: -h, --help show this help message and exit required arguments: -i , --input-prefix Input prefix to find files from ganon classify (.all and optionally .rep) (default: None) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for reassigned file (.one and optionally .rep). In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.out' (default: None) EM arguments: -e , --max-iter Max. number of iterations for the EM algorithm. If 0, will run until convergence (check --threshold) (default: 10) -s , --threshold Convergence threshold limit to stop the EM algorithm. (default: 0) other arguments: --remove-all Remove input file (.all) after processing. (default: False) --skip-one Do not write output file (.one) after processing. (default: False) --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) ganon report usage: ganon report [-h] -i [...] [-e INPUT_EXTENSION] -o OUTPUT_PREFIX [-d [...]] [-x] [-m [...]] [-z [...]] [--skip-genome-size] [-f] [-t] [-r [...]] [-s] [-a] [-y] [-p [...]] [-k [...]] [-c] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.rep' file(s) from ganon classify. (default: None) -e INPUT_EXTENSION, --input-extension INPUT_EXTENSION Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: rep) -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX Output prefix for report file 'output_prefix.tre'. In case of multiple files, the base input filename will be appended at the end of the output file 'output_prefix + FILENAME.tre' (default: None) db/tax arguments: -d [ ...], --db-prefix [ ...] Database prefix(es) used for classification. Only '.tax' file(s) are required. If not provided, new taxonomy will be downloaded. Mutually exclusive with --taxonomy. (default: []) -x , --taxonomy Taxonomy database to use [ncbi, gtdb, skip]. Mutually exclusive with --db-prefix. (default: ncbi) -m [ ...], --taxonomy-files [ ...] Specific files for taxonomy - otherwise files will be downloaded (default: None) -z [ ...], --genome-size-files [ ...] Specific files for genome size estimation - otherwise files will be downloaded (default: None) --skip-genome-size Do not attempt to get genome sizes. Valid only without --db-prefix. Activate this option when using sequences not representing full genomes. (default: False) output arguments: -f , --output-format Output format [text, tsv, csv, bioboxes]. text outputs a tabulated formatted text file for better visualization. bioboxes is the the CAMI challenge profiling format (only percentage/abundances are reported). (default: tsv) -t , --report-type Type of report [abundance, reads, matches, dist, corr]. 'abundance' -> tax. abundance (re- distribute read counts and correct by genome size), 'reads' -> sequence abundance, 'matches' -> report all unique and shared matches, 'dist' -> like reads with re-distribution of shared read counts only, 'corr' -> like abundance without re-distribution of shared read counts (default: abundance) -r [ ...], --ranks [ ...] Ranks to report ['', 'all', custom list]. 'all' for all possible ranks. empty for default ranks [superkingdom, phylum, class, order, family, genus, species, assembly]. (default: []) -s , --sort Sort report by [rank, lineage, count, unique]. Default: rank (with custom --ranks) or lineage (with --ranks all) (default: ) -a, --no-orphan Omit orphan nodes from the final report. Otherwise, orphan nodes (= nodes not found in the db/tax) are reported as 'na' with root as direct parent. (default: False) -y, --split-hierarchy Split output reports by hierarchy (from ganon classify --hierarchy-labels). If activated, the output files will be named as '{output_prefix}.{hierarchy}.tre' (default: False) -p [ ...], --skip-hierarchy [ ...] One or more hierarchies to skip in the report (from ganon classify --hierarchy-labels) (default: []) -k [ ...], --keep-hierarchy [ ...] One or more hierarchies to keep in the report (from ganon classify --hierarchy-labels) (default: []) -c , --top-percentile Top percentile filter, based on percentage/relative abundance. Applied only at default ranks [superkingdom, phylum, class, order, family, genus, species, assembly] (default: 0) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: []) ganon table usage: ganon table [-h] -i [...] [-e] -o OUTPUT_FILE [-l] [-f] [-t] [-a] [-m] [-r] [-n] [--header] [--unclassified-label] [--filtered-label] [--skip-zeros] [--transpose] [--verbose] [--quiet] [--min-count] [--max-count] [--names [...]] [--names-with [...]] [--taxids [...]] options: -h, --help show this help message and exit required arguments: -i [ ...], --input [ ...] Input file(s) and/or folder(s). '.tre' file(s) from ganon report. (default: None) -e , --input-extension Required if --input contains folder(s). Wildcards/Shell Expansions not supported (e.g. *). (default: tre) -o OUTPUT_FILE, --output-file OUTPUT_FILE Output filename for the table (default: None) output arguments: -l , --output-value Output value on the table [percentage, counts]. percentage values are reported between [0-1] (default: counts) -f , --output-format Output format [tsv, csv] (default: tsv) -t , --top-sample Top hits of each sample individually (default: 0) -a , --top-all Top hits of all samples (ranked by percentage) (default: 0) -m , --min-frequency Minimum number/percentage of files containing an taxa to keep the taxa [values between 0-1 for percentage, >1 specific number] (default: 0) -r , --rank Define specific rank to report. Empty will report all ranks. (default: None) -n, --no-root Do not report root node entry and lineage. Direct and shared matches to root will be accounted as unclassified (default: False) --header Header information [name, taxid, lineage] (default: name) --unclassified-label Add column with unclassified count/percentage with the chosen label. May be the same as --filtered-label (e.g. unassigned) (default: None) --filtered-label Add column with filtered count/percentage with the chosen label. May be the same as --unclassified-label (e.g. unassigned) (default: None) --skip-zeros Do not print lines with only zero count/percentage (default: False) --transpose Transpose output table (taxa as cols and files as rows) (default: False) optional arguments: --verbose Verbose output mode (default: False) --quiet Quiet output mode (default: False) filter arguments: --min-count Minimum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --max-count Maximum number/percentage of counts to keep an taxa [values between 0-1 for percentage, >1 specific number] (default: 0) --names [ ...] Show only entries matching exact names of the provided list (default: []) --names-with [ ...] Show entries containing full or partial names of the provided list (default: []) --taxids [ ...] One or more taxids to report (including children taxa) (default: [])","title":"Parameters"},{"location":"classification/","text":"Classification :) ganon classify will match single and/or paired-end sets of reads against one or more databases . By default, parameters are optimized for taxonomic profiling , meaning that less reads will be classified but with a higher sensitivity. For example: ganon classify --db-prefix my_db --paired-reads reads.1.fq.gz reads.2.fq.gz --output-prefix results --threads 32 Output files: results.rep : plain report of the run, used to further generate tree-like reports results.tre : tree-like report with cumulative abundances by taxonomic ranks (can be re-generated with ganon report ) By default, ganon classify only write report files. To get files with the classification of each read, use --output-one and/or --output-all . More information about output files here . Note ganon performs taxonomic profiling and/or binning (one tax. assignment for each read) at a taxonomic, strain or sequence level. Some guidelines are listed below, please choose the parameters according to your application. Profiling :) ganon classify is set-up by default to perform taxonomic profiling. It uses: strict thresholds: --rel-cutoff 0.75 and --rel-filter 0.1 --min-count 0.00005 (0.005%) to exclude very low abundant taxa --report-type abundance to generate taxonomic abundances, correcting for genome sizes (more infos here ) Binning :) To achieve better results for taxonomic binning or sequence classification, ganon classify can be configured with --binning , that is the same as: less strict thresholds: --rel-cutoff 0.25 --rel-filter 0 --min-count 0 reports all taxa with at least one read assigned to it --report-type reads will report sequence abundances instead of taxonomic abundances (more infos here ) Tip Database parameters in ganon build can also influence your results. Lower --max-fp (e.g. 0.1, 0.001) and higher --kmer-size (e.g. 23 , 27 ) will improve sensitivity of your results at cost of a larger database and memory usage. Reads with multiple matches :) There are two ways to solve reads with multiple-matches in ganon classify : --multiple-matches em (default): uses an Expectation-Maximization algorithm, re-assigning reads with multiple matches to one most probable target (defined by --level in the build procedure). --multiple-matches lca : uses the Lowest Common Ancestor algorithm, re-assigning reads with multiple matches to higher common ancestors in the taxonomic tree. --multiple-matches skip : will not resolve multi-matching reads Tip The Expectation-Maximization can be performed independently with ganon reassign using the output files .rep and .all . Reports can be generated independently with ganon report using the output file .rep Note --multiple-matches lca paired with --report-type abundance or dist will distribute read counts with multiple matches to one most probable target (defined by --level in the build procedure), instead of a higher taxonomic rank. In this case the distribution is simply based on the number of taxa with unique matches and it is not as precise as the EM algorithm, but it will run faster since the per-read basis re-assignment can be skipped. Classifying more reads :) By default ganon will classify less reads in favour of sensitivity. To classify more reads, use less strict --rel-cutoff and --rel-filter values (e.g. 0.25 and 0 , respectively). More details here . Multiple and Hierarchical classification :) ganon classify can be performed in multiple databases at the same time. The databases can also be provided in a hierarchical order. Multiple database classification can be performed providing several inputs for --db-prefix . They are required to be built with the same --kmer-size and --window-size values. Multiple databases are considered as one (as if built together) and redundancy in content (same reference in two or more databases) is allowed. To classify reads in a hierarchical order, --hierarchy-labels should be provided. When using multiple hierarchical levels, output files will be generated for each level (use --output-single to generate a single output from multiple hierarchical levels). Please note that some parameters are set for each database (e.g. --rel-cutoff ) while others are set for each hierarchical level (e.g. --rel-filter ) Examples Classification against 3 database (as if they were one) using the same cutoff: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.75 \\ --single-reads reads.fq.gz Classification against 3 database (as if they were one) using different error rates for each: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.2 0.3 0.1 \\ --single-reads reads.fq.gz In this example, reads are going to be classified first against db1 and db2. Reads without a valid match will be further classified against db3. `--hierarchy-labels` are strings and are going to be sorted to define the hierarchy order, disregarding input order: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --single-reads reads.fq.gz In this example, classification will be performed with different `--rel-cutoff` for each database. For each hierarchy levels (`1_first` and `2_second`) a different `--rel-filter` will be used: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --rel-cutoff 1 0.5 0.25 \\ --rel-filter 0.1 0.5 \\ --single-reads reads.fq.gz Parameter details :) reads (--single-reads, --paired-reads) :) ganon accepts single-end and paired-end reads. Both types can be use at the same time. In paired-end mode, reads are always reported with the header of the first pair. Paired-end reads are classified in a standard forward-reverse orientation. The max. read length accepted is 65535 (accounting both reads in paired mode). cutoff and filter (--rel-cutoff, --rel-filter) :) ganon has two main parameters to control the number and strictess of matches between reads and references: --rel-cutoff and --rel-filter . Every read can be classified against none, one or more references. ganon will report only matches after cutoff and filter thresholds are applied, based on the number of shared k-mers between sequences (use --rel-cutoff 0 and --rel-filter 1 to deactivate them). The cutoff is the first to be applied. It sets the min. percentage of k-mers of a read to be shared with a reference to consider a match. Next the filter is applied to the remaining matches. filter thresholds are relative to the best and worst scoring match after cutoff and control the percentage of additional matches (if any) should be reported, sorted from the best to worst. filter won't change the total number of matched reads but will change the amount of unique or multi-matched reads. cutoff can be interpreted as the lower bound to discard spurious matches and filter as the fine tuning to control what to keep. In summary: --rel-cutoff controls the strictness of the matching algorithm. lower values -> more read matches higher values -> less read matches --rel-filter controls how many matches each read will have, from best to worst lower values -> more unique matching reads higher values -> more multi-matching reads For example, using a hypothetical number of k-mer matches, a certain read with 82 k-mers has the following matches with the 5 references ( ref1..5 ), sorted number of shared k-mers: reference shared k-mers ref1 82 ref2 68 ref3 44 ref4 25 ref5 20 With --rel-cutoff 0.25 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 ref1 82 ref2 68 ref3 44 ref4 25 ~~ref5~~ ~~20~~ X since the --rel-cutoff threshold is 82 * 0.25 = 21 (ceiling is applied). Next, with --rel-filter 0.5 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 --rel-filter 0.5 ref1 82 ref2 68 ~~ref3~~ ~~44~~ X ~~ref4~~ ~~25~~ X ~~ref5~~ ~~20~~ X since 82 is the best match and 25 is the worst remaining match, the filter will keep the top the remaining matches, based on the shared k-mers threshold 82 - ((82-25)*0.5) = 54 (ceiling is applied). ref1 and ref2 are reported as matches Tip The actual number of unique k-mers in a read are used as an upper bound to calculate the thresholds. The same is applied when using --window-size and minimizers. Note A different --rel-cutoff can be set for every database in a multiple or hierarchical database classification. A different --rel-filter can be set for every level of a hierarchical database classification. Note Reads that remain with only one reference match (after cutoff and filter are applied) are considered a unique match. False positive of a query (--fpr-query) :) ganon uses Bloom Filters, probabilistic data structures that may return false positive results. The base false positive of a ganon index is controlled by --max-fp when building the database. However, this value is the expected false positive for each k-mer. In practice, a sequence (several k-mers) will have a way smaller false positive. ganon calculates the false positive rate of a query as suggested by (Solomon and Kingsford, 2016). The --fpr-query will control the max. value accepted to consider a match between a sequence and a reference, avoiding false positives that may be introduce by the properties of the data structure. By default, --fpr-query 1e-5 is used and it is applied after the --rel-cutoff and --rel-filter . Values between 1e-3 and 1e-10 are recommended. This threshold becomes more important when building smaller databases with higher --max-fp , assuring that the false positive is under control. In this case however, sensitivity of results may decrease. Note The false positive of a query was first propose in: Solomon, Brad, and Carl Kingsford. \u201cFast Search of Thousands of Short-Read Sequencing Experiments.\u201d Nature Biotechnology 34, no. 3 (2016): 1\u20136. https://doi.org/10.1038/nbt.3442.","title":"Classification (ganon classify)"},{"location":"classification/#classification","text":"ganon classify will match single and/or paired-end sets of reads against one or more databases . By default, parameters are optimized for taxonomic profiling , meaning that less reads will be classified but with a higher sensitivity. For example: ganon classify --db-prefix my_db --paired-reads reads.1.fq.gz reads.2.fq.gz --output-prefix results --threads 32 Output files: results.rep : plain report of the run, used to further generate tree-like reports results.tre : tree-like report with cumulative abundances by taxonomic ranks (can be re-generated with ganon report ) By default, ganon classify only write report files. To get files with the classification of each read, use --output-one and/or --output-all . More information about output files here . Note ganon performs taxonomic profiling and/or binning (one tax. assignment for each read) at a taxonomic, strain or sequence level. Some guidelines are listed below, please choose the parameters according to your application.","title":"Classification"},{"location":"classification/#profiling","text":"ganon classify is set-up by default to perform taxonomic profiling. It uses: strict thresholds: --rel-cutoff 0.75 and --rel-filter 0.1 --min-count 0.00005 (0.005%) to exclude very low abundant taxa --report-type abundance to generate taxonomic abundances, correcting for genome sizes (more infos here )","title":"Profiling"},{"location":"classification/#binning","text":"To achieve better results for taxonomic binning or sequence classification, ganon classify can be configured with --binning , that is the same as: less strict thresholds: --rel-cutoff 0.25 --rel-filter 0 --min-count 0 reports all taxa with at least one read assigned to it --report-type reads will report sequence abundances instead of taxonomic abundances (more infos here ) Tip Database parameters in ganon build can also influence your results. Lower --max-fp (e.g. 0.1, 0.001) and higher --kmer-size (e.g. 23 , 27 ) will improve sensitivity of your results at cost of a larger database and memory usage.","title":"Binning"},{"location":"classification/#reads-with-multiple-matches","text":"There are two ways to solve reads with multiple-matches in ganon classify : --multiple-matches em (default): uses an Expectation-Maximization algorithm, re-assigning reads with multiple matches to one most probable target (defined by --level in the build procedure). --multiple-matches lca : uses the Lowest Common Ancestor algorithm, re-assigning reads with multiple matches to higher common ancestors in the taxonomic tree. --multiple-matches skip : will not resolve multi-matching reads Tip The Expectation-Maximization can be performed independently with ganon reassign using the output files .rep and .all . Reports can be generated independently with ganon report using the output file .rep Note --multiple-matches lca paired with --report-type abundance or dist will distribute read counts with multiple matches to one most probable target (defined by --level in the build procedure), instead of a higher taxonomic rank. In this case the distribution is simply based on the number of taxa with unique matches and it is not as precise as the EM algorithm, but it will run faster since the per-read basis re-assignment can be skipped.","title":"Reads with multiple matches"},{"location":"classification/#classifying-more-reads","text":"By default ganon will classify less reads in favour of sensitivity. To classify more reads, use less strict --rel-cutoff and --rel-filter values (e.g. 0.25 and 0 , respectively). More details here .","title":"Classifying more reads"},{"location":"classification/#multiple-and-hierarchical-classification","text":"ganon classify can be performed in multiple databases at the same time. The databases can also be provided in a hierarchical order. Multiple database classification can be performed providing several inputs for --db-prefix . They are required to be built with the same --kmer-size and --window-size values. Multiple databases are considered as one (as if built together) and redundancy in content (same reference in two or more databases) is allowed. To classify reads in a hierarchical order, --hierarchy-labels should be provided. When using multiple hierarchical levels, output files will be generated for each level (use --output-single to generate a single output from multiple hierarchical levels). Please note that some parameters are set for each database (e.g. --rel-cutoff ) while others are set for each hierarchical level (e.g. --rel-filter ) Examples Classification against 3 database (as if they were one) using the same cutoff: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.75 \\ --single-reads reads.fq.gz Classification against 3 database (as if they were one) using different error rates for each: ganon classify --db-prefix db1 db2 db3 \\ --rel-cutoff 0.2 0.3 0.1 \\ --single-reads reads.fq.gz In this example, reads are going to be classified first against db1 and db2. Reads without a valid match will be further classified against db3. `--hierarchy-labels` are strings and are going to be sorted to define the hierarchy order, disregarding input order: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --single-reads reads.fq.gz In this example, classification will be performed with different `--rel-cutoff` for each database. For each hierarchy levels (`1_first` and `2_second`) a different `--rel-filter` will be used: ganon classify --db-prefix db1 db2 db3 \\ --hierarchy-labels 1_first 1_first 2_second \\ --rel-cutoff 1 0.5 0.25 \\ --rel-filter 0.1 0.5 \\ --single-reads reads.fq.gz","title":"Multiple and Hierarchical classification"},{"location":"classification/#parameter-details","text":"","title":"Parameter details"},{"location":"classification/#reads-single-reads-paired-reads","text":"ganon accepts single-end and paired-end reads. Both types can be use at the same time. In paired-end mode, reads are always reported with the header of the first pair. Paired-end reads are classified in a standard forward-reverse orientation. The max. read length accepted is 65535 (accounting both reads in paired mode).","title":"reads (--single-reads, --paired-reads)"},{"location":"classification/#cutoff-and-filter-rel-cutoff-rel-filter","text":"ganon has two main parameters to control the number and strictess of matches between reads and references: --rel-cutoff and --rel-filter . Every read can be classified against none, one or more references. ganon will report only matches after cutoff and filter thresholds are applied, based on the number of shared k-mers between sequences (use --rel-cutoff 0 and --rel-filter 1 to deactivate them). The cutoff is the first to be applied. It sets the min. percentage of k-mers of a read to be shared with a reference to consider a match. Next the filter is applied to the remaining matches. filter thresholds are relative to the best and worst scoring match after cutoff and control the percentage of additional matches (if any) should be reported, sorted from the best to worst. filter won't change the total number of matched reads but will change the amount of unique or multi-matched reads. cutoff can be interpreted as the lower bound to discard spurious matches and filter as the fine tuning to control what to keep. In summary: --rel-cutoff controls the strictness of the matching algorithm. lower values -> more read matches higher values -> less read matches --rel-filter controls how many matches each read will have, from best to worst lower values -> more unique matching reads higher values -> more multi-matching reads For example, using a hypothetical number of k-mer matches, a certain read with 82 k-mers has the following matches with the 5 references ( ref1..5 ), sorted number of shared k-mers: reference shared k-mers ref1 82 ref2 68 ref3 44 ref4 25 ref5 20 With --rel-cutoff 0.25 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 ref1 82 ref2 68 ref3 44 ref4 25 ~~ref5~~ ~~20~~ X since the --rel-cutoff threshold is 82 * 0.25 = 21 (ceiling is applied). Next, with --rel-filter 0.5 , the following matches will be discarded: reference shared k-mers --rel-cutoff 0.25 --rel-filter 0.5 ref1 82 ref2 68 ~~ref3~~ ~~44~~ X ~~ref4~~ ~~25~~ X ~~ref5~~ ~~20~~ X since 82 is the best match and 25 is the worst remaining match, the filter will keep the top the remaining matches, based on the shared k-mers threshold 82 - ((82-25)*0.5) = 54 (ceiling is applied). ref1 and ref2 are reported as matches Tip The actual number of unique k-mers in a read are used as an upper bound to calculate the thresholds. The same is applied when using --window-size and minimizers. Note A different --rel-cutoff can be set for every database in a multiple or hierarchical database classification. A different --rel-filter can be set for every level of a hierarchical database classification. Note Reads that remain with only one reference match (after cutoff and filter are applied) are considered a unique match.","title":"cutoff and filter (--rel-cutoff, --rel-filter)"},{"location":"classification/#false-positive-of-a-query-fpr-query","text":"ganon uses Bloom Filters, probabilistic data structures that may return false positive results. The base false positive of a ganon index is controlled by --max-fp when building the database. However, this value is the expected false positive for each k-mer. In practice, a sequence (several k-mers) will have a way smaller false positive. ganon calculates the false positive rate of a query as suggested by (Solomon and Kingsford, 2016). The --fpr-query will control the max. value accepted to consider a match between a sequence and a reference, avoiding false positives that may be introduce by the properties of the data structure. By default, --fpr-query 1e-5 is used and it is applied after the --rel-cutoff and --rel-filter . Values between 1e-3 and 1e-10 are recommended. This threshold becomes more important when building smaller databases with higher --max-fp , assuring that the false positive is under control. In this case however, sensitivity of results may decrease. Note The false positive of a query was first propose in: Solomon, Brad, and Carl Kingsford. \u201cFast Search of Thousands of Short-Read Sequencing Experiments.\u201d Nature Biotechnology 34, no. 3 (2016): 1\u20136. https://doi.org/10.1038/nbt.3442.","title":"False positive of a query (--fpr-query)"},{"location":"custom_databases/","text":"Custom databases :) Besides the automated download and build ( ganon build ) ganon provides a highly customizable build procedure ( ganon build-custom ) to create databases from local sequence files. The usage of this procedure depends on the configuration of your files: Filename like GCA_002211645.1_ASM221164v1_genomic.fna.gz : genomic fasta files in the NCBI standard, with assembly accession in the beginning of the filename. Provide the files with the --input parameter. ganon will try to retrieve all necessary information to build the database. Headers like >NC_006297.1 Bacteroides fragilis YCH46 ... : sequence headers are in the NCBI standard, with sequence accession in after > and with a space afterwards (or line break). Provide the files with the --input parameter and set --input-target sequence . ganon will try to retrieve all necessary information to build the database. For non-standard filenames and headers, follow this Warning --input-target sequence will be slower to build and will use more disk space, since files have be re-written separately for each sequence. More information about building by file or sequence can be found here . The --level is a important parameter that will define the (max.) classification level for the database ( more infos ): --level file or sequence -> default behavior (depending on --input-target ), use file/sequence as classification target --level assembly -> will retrieve assembly related to the file/sequence, use assembly as classification target --level leaves or species , genus ,... -> group input by taxonomy, use tax. nodes at the rank chosen as classification target More infos about other parameters here . Non-standard files/headers with --input-file :) Alternatively to the automatic input methods, it is possible to manually define the input with either standard or non-standard filenames, accessions and headers to build custom databases with --input-file . This file should contain the following fields (tab-separated): file [ target node specialization specialization_name]. file : relative or full path to the sequence file target : any unique text to name the file, to be used in the taxonomy node : taxonomic node (e.g. taxid) to link entry with taxonomy specialization : creates a specialized taxonomic level with a custom name, allowing files to be grouped specialization_name : a name for the specialization, to be used in the taxonomy Warning the target and specialization fields (2nd and 4th col) cannot be the same as the node (3rd col) Below you find example of --input-file . Note they are slightly different depending on the --input-target chosen. They need to be tab-separated to be properly parsed (tsv). Examples of --input-file using the default --input-target file :) List of files :) sequences.fasta others.fasta No taxonomic information is provided so --taxonomy skip should be set. The classification against the generated database will be performed at file level ( --level file ), since that is the only available information given. List of files with alternative names :) sequences.fasta sequences others.fasta others Just like above, but with a specific name to be used for each file. Files and taxonomy :) sequences.fasta sequences 562 others.fasta others 623 The classification max. level against this database will depend on the value set for --level : --level file -> use the file (named with target) with node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy Files, taxonomy and specialization :) sequences.fasta sequences 562 ID44444 Escherichia coli TW10119 others.fasta others 623 ID55555 Shigella flexneri 1a The classification max. level against this database will depend on the value set for --level : --level custom -> use the specialization (named with specialization_name) with node as parent --level file -> use the file (named with target) as a tax. node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy Examples of --input-file using --input-target sequence :) To provide a tabular information for every sequence in your files, you need to use the target field (2nd col.) of the --input-file to input sequence headers. For example: Sequences and taxonomy :) sequences.fasta NZ_CP054001.1 562 sequences.fasta NZ_CP117955.1 623 others.fasta header1 666 others.fasta header2 666 The classification max. level against this database will depend on the value set for --level : --level sequence -> use the sequence header with node as parent --level assembly -> will attempt to retrieve the assembly related to the sequence with node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy Sequences, taxonomy and specialization :) sequences.fasta NZ_CP054001.1 562 ID44444 Escherichia coli TW10119 sequences.fasta NZ_CP117955.1 623 ID55555 Shigella flexneri 1a others.fasta header1 666 StrainA My Strain others.fasta header2 666 StrainA My Strain The classification max. level against this database will depend on the value set for --level : --level custom -> use the specialization (named with specialization_name) with node as parent --level sequence -> use the sequence header with node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy Examples :) Below you will find some examples from commonly used repositories for metagenomics analysis with ganon build-custom : HumGut :) Collection of >30000 genomes from healthy human metagenomes. Article / Website . # Download sequence files wget --quiet --show-progress \"http://arken.nmbu.no/~larssn/humgut/HumGut.tar.gz\" tar xf HumGut.tar.gz # Download taxonomy and metadata files wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_names.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/HumGut.tsv\" # Generate --input-file from metadata tail -n+2 HumGut.tsv | awk -F\"\\t\" '{print \"fna/\"$21\"\\t\"$1\"\\t\"$2}' > HumGut_ganon_input_file.tsv # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files ncbi_nodes.dmp ncbi_names.dmp --db-prefix HumGut --level strain --threads 32 Similarly using GTDB taxonomy files: # Download taxonomy files wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_names.dmp\" # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files gtdb_nodes.dmp gtdb_names.dmp --db-prefix HumGut_gtdb --level strain --threads 32 Note There is no need to use ganon's gtdb integration here since GTDB files in NCBI format are available Plasmid, Plastid and Mitochondrion from RefSeq :) Extra repositories from RefSeq release not included as default databases. Website . # Download sequence files wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/\" ganon build-custom --input plasmid.* plastid.* mitochondrion.* --db-prefix ppm --level species --threads 8 --input-target sequence UniVec, UniVec_core :) \"UniVec is a non-redundant database of sequences commonly attached to cDNA or genomic DNA during the cloning process.\" Website . Useful to screen for vector and linker/adapter contamination. UniVec_core is a sub-set of the UniVec selected to reduce the false positive hits from real biological sources. # UniVec wget -O \"UniVec.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec\" echo -e \"UniVec.fasta\\tUniVec\\t81077\" > UniVec_ganon_input_file.tsv ganon build-custom --input-file UniVec_ganon_input_file.tsv --db-prefix UniVec --level leaves --threads 8 --skip-genome-size # UniVec_Core wget -O \"UniVec_Core.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec_Core\" echo -e \"UniVec_Core.fasta\\tUniVec_Core\\t81077\" > UniVec_Core_ganon_input_file.tsv ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves --threads 8 --skip-genome-size Note All UniVec entries in the examples are categorized as Artificial Sequence (NCBI txid:81077). Some are completely artificial but others may be derived from real biological sources. More information in this link . MGnify genome catalogues (MAGs) :) \"Genome catalogues are biome-specific collections of metagenomic-assembled and isolate genomes\". Article / Website / FTP . Currently available genome catalogues (2024-02-09): chicken-gut cow-rumen honeybee-gut human-gut human-oral human-vaginal marine mouse-gut non-model-fish-gut pig-gut zebrafish-fecal List currently available entries curl --silent --list-only ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/ Example on how to download and build the human-oral catalog: # Download metadata wget \"https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-oral/v1.0.1/genomes-all_metadata.tsv\" # Download sequence files with 12 threads tail -n+2 genomes-all_metadata.tsv | cut -f 1,20 | xargs -P 12 -n2 sh -c 'curl --silent ${1}| gzip -d | sed -e \"1,/##FASTA/ d\" | gzip > ${0}.fna.gz' # Generate ganon input file tail -n+2 genomes-all_metadata.tsv | cut -f 1,15 | tr ';' '\\t' | awk -F\"\\t\" '{tax=\"1\";for(i=NF;i>1;i--){if(length($i)>3){tax=$i;break;}};print $1\".fna.gz\\t\"$1\"\\t\"tax}' > ganon_input_file.tsv # Build ganon database ganon build-custom --input-file ganon_input_file.tsv --db-prefix mgnify_human_oral_v101 --taxonomy gtdb --level leaves --threads 8 Note MGnify genomes catalogues will be build with GTDB taxonomy. Pathogen detection FDA-ARGOS :) A collection of >1400 \"microbes that include biothreat microorganisms, common clinical pathogens and closely related species\". Article / Website / BioProject . # Download sequence files wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt grep \"strain=FDAARGOS\" assembly_summary_refseq.txt > fdaargos_assembly_summary.txt genome_updater.sh -e fdaargos_assembly_summary.txt -f \"genomic.fna.gz\" -o download -m -t 12 # Build ganon database ganon build-custom --input download/ --input-recursive --db-prefix fdaargos --ncbi-file-info download/assembly_summary.txt --level assembly --threads 32 Note The example above uses genome_updater to download files BLAST databases (nt env_nt nt_prok ...) :) BLAST databases. Website / FTP . Current available nucleotide databases (2024-02-09): 16S_ribosomal_RNA 18S_fungal_sequences 28S_fungal_sequences Betacoronavirus env_nt human_genome ITS_eukaryote_sequences ITS_RefSeq_Fungi LSU_eukaryote_rRNA LSU_prokaryote_rRNA mito mouse_genome nt nt_euk nt_others nt_prok nt_viruses patnt pdbnt ref_euk_rep_genomes ref_prok_rep_genomes refseq_rna refseq_select_rna ref_viroids_rep_genomes ref_viruses_rep_genomes SSU_eukaryote_rRNA tsa_nt List currently available entries curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"nucl-metadata.json\" | sed 's/-nucl-metadata.json/, /g' | sort Warning Some BLAST databases are very big and may require extreme computational resources to build. You may need to use some reduction strategies . The example shows how to download , parse and build a ganon database from BLAST database files. It does so by splitting the database into taxonomic specific files, to speed-up the build process: # Define BLAST db db=\"16S_ribosomal_RNA\" threads=8 # Download BLAST db - re-run this command many times until all finish (no more output) curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"^${db}\\..*tar.gz$\" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress \"https://ftp.ncbi.nlm.nih.gov/blast/db/{}\" # OPTIONAL Download and check MD5 wget -O - -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}\\.*tar.gz.md5\" > \"${db}.md5\" find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads:-1} -I{} md5sum {} > \"${db}_downloaded.md5\" diff -sy <(sort -k 2,2 \"${db}.md5\") <(sort -k 2,2 \"${db}_downloaded.md5\") # Should print \"Files /dev/fd/xx and /dev/fd/xx are identical\" # Extract BLAST db files, if successful, remove .tar.gz find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > \"${db}_extracted_files.txt\" # Create folder to write sequence files (split into 10 sub-folders) seq 0 9 | xargs -i mkdir -p \"${db}\"/{} # This command extracts sequences from the blastdb and writes them into taxid specific files # It also generates the --input-file for ganon with the fields: filepath file taxid blastdbcmd -entry all -db \"${db}\" -outfmt \"%a %T %s\" | \\ awk -v db=\"$(realpath ${db})\" '{file=db\"/\"substr($2,1,1)\"/\"$2\".fna\"; print \">\"$1\"\\n\"$3 >> file; print file\"\\t\"$2\".fna\\t\"$2}' | \\ sort | uniq > \"${db}_ganon_input_file.tsv\" # Build ganon database ganon build-custom --input-file \"${db}_ganon_input_file.tsv\" --db-prefix \"${db}\" --threads ${threads} --level leaves # Delete extracted files and auxiliary files cat \"${db}_extracted_files.txt\" | xargs rm rm \"${db}_extracted_files.txt\" \"${db}.md5\" \"${db}_downloaded.md5\" # Delete sequences and input_file rm -rf \"${db}\" \"${db}_ganon_input_file.tsv\" Note blastdbcmd is a command from BLAST+ software suite (tested version 2.14.0) and should be installed separately. Files from genome_updater :) To create a ganon database from files previously downloaded with genome_updater : ganon build-custom --input output_folder_genome_updater/version/ --input-recursive --db-prefix mydb --ncbi-file-info output_folder_genome_updater/assembly_summary.txt --level assembly --threads 32 Parameter details :) False positive and size (--max-fp, --filter-size) :) ganon indices are based on bloom filters and can have false positive matches. This can be controlled with --max-fp parameter. The lower the --max-fp , the less chances of false positives matches on classification, but the larger the database size will be. For example, with --max-fp 0.01 the database will be build so any target (defined by --level ) will have 1 in a 100 change of reporting a false k-mer match. The false positive of the query (all k-mers of a read) will be way lower, but directly affected by this value. Alternatively, one can set a specific size for the final index with --filter-size . When using this option, please observe the theoretic false positive of the index reported at the end of the building process. minimizers (--window-size, --kmer-size) :) in ganon build , when --window-size > --kmer-size minimizers are used. That means that for a every window, a single k-mer will be selected. It produces smaller database files and requires substantially less memory overall. It may increase building times but will have a huge benefit for classification times. Sensitivity and precision can be reduced by small margins. If --window-size = --kmer-size , all k-mers are going to be used to build the database. Target file or sequence (--input-target) :) This is a parameter that defines how ganon will parse your input files: - --input-target file (default) will consider every file provided with --input a single unit (e.g. multi-fasta files are considered one input, sequence headers ignored). - --input-target sequence will use every sequence as a unit. For this, ganon will first decompose every sequence in the input files provided with --input into a separated file. This will take longer and use more disk space. --input-target file is the default behavior and most efficient way to build databases. --input-target sequence should only be used when the input sequences are not separated by file (e.g. a single big FASTA file) or when classification at sequence level is desired. Build level (--level) :) The --level parameter defines the max. depth of the database for classification. This parameter is relevant because the --max-fp is going to be guaranteed at the --level chosen. In ganon build the default value is species . In ganon build-custom the level will be the same as --input-target , meaning that classification will be done either at file or sequence level. Alternatively, --level assembly will link the file or sequence target information with assembly accessions retrieved from NCBI. --level leaves or --level species (or genus , family , ...) will link the targets with taxonomic information and prune the tree at the chosen level. --level custom will use specialization (4th col.) defined in the --input-file . Genome sizes (--genome-size-files) :) Ganon will automatically download auxiliary files to define an approximate genome size for each entry in the taxonomic tree. For --taxonomy ncbi the species_genome_size.txt.gz is used. For --taxonomy gtdb the *_metadata.tar.gz files are used. Those files can be directly provided with the --genome-size-files argument. Genome sizes of parent nodes are calculated as the average of the respective children nodes. Other nodes without direct assigned genome sizes will use the closest parent with a pre-calculated genome size. The genome sizes are stored in the ganon database . Retrieving info (--ncbi-sequence-info, --ncbi-file-info) :) Further taxonomy and assembly linking information has to be collected to properly build the database. --ncbi-sequence-info and --ncbi-file-info allow customizations on this step. When --input-target sequence , --ncbi-sequence-info argument allows the use of NCBI e-utils webservices ( eutils ) or downloads accession2taxid files to extract target information (options nucl_gb nucl_wgs nucl_est nucl_gss pdb prot dead_nucl dead_wgs dead_prot ). By default, ganon uses eutils up-to 50000 input sequences, otherwise it downloads nucl_gb nucl_wgs from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/. Previously downloaded files can be directly provided with this argument. When --input-target file , --ncbi-file-info uses assembly_summary.txt from https://ftp.ncbi.nlm.nih.gov/genomes/ to extract target information (options refseq genbank refseq_historical genbank_historical . Previously downloaded files can be directly provided with this argument. If you are using outdated, removed or inactive assembly or sequence files and accessions from NCBI, make sure to include dead_nucl dead_wgs for --ncbi-sequence-info or refseq_historical genbank_historical for --ncbi-file-info . eutils option does not work with outdated accessions.","title":"Custom databases (ganon build-custom)"},{"location":"custom_databases/#custom-databases","text":"Besides the automated download and build ( ganon build ) ganon provides a highly customizable build procedure ( ganon build-custom ) to create databases from local sequence files. The usage of this procedure depends on the configuration of your files: Filename like GCA_002211645.1_ASM221164v1_genomic.fna.gz : genomic fasta files in the NCBI standard, with assembly accession in the beginning of the filename. Provide the files with the --input parameter. ganon will try to retrieve all necessary information to build the database. Headers like >NC_006297.1 Bacteroides fragilis YCH46 ... : sequence headers are in the NCBI standard, with sequence accession in after > and with a space afterwards (or line break). Provide the files with the --input parameter and set --input-target sequence . ganon will try to retrieve all necessary information to build the database. For non-standard filenames and headers, follow this Warning --input-target sequence will be slower to build and will use more disk space, since files have be re-written separately for each sequence. More information about building by file or sequence can be found here . The --level is a important parameter that will define the (max.) classification level for the database ( more infos ): --level file or sequence -> default behavior (depending on --input-target ), use file/sequence as classification target --level assembly -> will retrieve assembly related to the file/sequence, use assembly as classification target --level leaves or species , genus ,... -> group input by taxonomy, use tax. nodes at the rank chosen as classification target More infos about other parameters here .","title":"Custom databases"},{"location":"custom_databases/#non-standard-filesheaders-with-input-file","text":"Alternatively to the automatic input methods, it is possible to manually define the input with either standard or non-standard filenames, accessions and headers to build custom databases with --input-file . This file should contain the following fields (tab-separated): file [ target node specialization specialization_name]. file : relative or full path to the sequence file target : any unique text to name the file, to be used in the taxonomy node : taxonomic node (e.g. taxid) to link entry with taxonomy specialization : creates a specialized taxonomic level with a custom name, allowing files to be grouped specialization_name : a name for the specialization, to be used in the taxonomy Warning the target and specialization fields (2nd and 4th col) cannot be the same as the node (3rd col) Below you find example of --input-file . Note they are slightly different depending on the --input-target chosen. They need to be tab-separated to be properly parsed (tsv).","title":"Non-standard files/headers with --input-file"},{"location":"custom_databases/#examples-of-input-file-using-the-default-input-target-file","text":"","title":"Examples of --input-file using the default --input-target file"},{"location":"custom_databases/#list-of-files","text":"sequences.fasta others.fasta No taxonomic information is provided so --taxonomy skip should be set. The classification against the generated database will be performed at file level ( --level file ), since that is the only available information given.","title":"List of files"},{"location":"custom_databases/#list-of-files-with-alternative-names","text":"sequences.fasta sequences others.fasta others Just like above, but with a specific name to be used for each file.","title":"List of files with alternative names"},{"location":"custom_databases/#files-and-taxonomy","text":"sequences.fasta sequences 562 others.fasta others 623 The classification max. level against this database will depend on the value set for --level : --level file -> use the file (named with target) with node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy","title":"Files and taxonomy"},{"location":"custom_databases/#files-taxonomy-and-specialization","text":"sequences.fasta sequences 562 ID44444 Escherichia coli TW10119 others.fasta others 623 ID55555 Shigella flexneri 1a The classification max. level against this database will depend on the value set for --level : --level custom -> use the specialization (named with specialization_name) with node as parent --level file -> use the file (named with target) as a tax. node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy","title":"Files, taxonomy and specialization"},{"location":"custom_databases/#examples-of-input-file-using-input-target-sequence","text":"To provide a tabular information for every sequence in your files, you need to use the target field (2nd col.) of the --input-file to input sequence headers. For example:","title":"Examples of --input-file using --input-target sequence"},{"location":"custom_databases/#sequences-and-taxonomy","text":"sequences.fasta NZ_CP054001.1 562 sequences.fasta NZ_CP117955.1 623 others.fasta header1 666 others.fasta header2 666 The classification max. level against this database will depend on the value set for --level : --level sequence -> use the sequence header with node as parent --level assembly -> will attempt to retrieve the assembly related to the sequence with node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy","title":"Sequences and taxonomy"},{"location":"custom_databases/#sequences-taxonomy-and-specialization","text":"sequences.fasta NZ_CP054001.1 562 ID44444 Escherichia coli TW10119 sequences.fasta NZ_CP117955.1 623 ID55555 Shigella flexneri 1a others.fasta header1 666 StrainA My Strain others.fasta header2 666 StrainA My Strain The classification max. level against this database will depend on the value set for --level : --level custom -> use the specialization (named with specialization_name) with node as parent --level sequence -> use the sequence header with node as parent --level leaves or species , genus ,... -> files are grouped by taxonomy","title":"Sequences, taxonomy and specialization"},{"location":"custom_databases/#examples","text":"Below you will find some examples from commonly used repositories for metagenomics analysis with ganon build-custom :","title":"Examples"},{"location":"custom_databases/#humgut","text":"Collection of >30000 genomes from healthy human metagenomes. Article / Website . # Download sequence files wget --quiet --show-progress \"http://arken.nmbu.no/~larssn/humgut/HumGut.tar.gz\" tar xf HumGut.tar.gz # Download taxonomy and metadata files wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/ncbi_names.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/HumGut.tsv\" # Generate --input-file from metadata tail -n+2 HumGut.tsv | awk -F\"\\t\" '{print \"fna/\"$21\"\\t\"$1\"\\t\"$2}' > HumGut_ganon_input_file.tsv # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files ncbi_nodes.dmp ncbi_names.dmp --db-prefix HumGut --level strain --threads 32 Similarly using GTDB taxonomy files: # Download taxonomy files wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_nodes.dmp\" wget \"https://arken.nmbu.no/~larssn/humgut/gtdb_names.dmp\" # Build ganon database ganon build-custom --input-file HumGut_ganon_input_file.tsv --taxonomy-files gtdb_nodes.dmp gtdb_names.dmp --db-prefix HumGut_gtdb --level strain --threads 32 Note There is no need to use ganon's gtdb integration here since GTDB files in NCBI format are available","title":"HumGut"},{"location":"custom_databases/#plasmid-plastid-and-mitochondrion-from-refseq","text":"Extra repositories from RefSeq release not included as default databases. Website . # Download sequence files wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/\" wget -A genomic.fna.gz -m -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/\" ganon build-custom --input plasmid.* plastid.* mitochondrion.* --db-prefix ppm --level species --threads 8 --input-target sequence","title":"Plasmid, Plastid and Mitochondrion from RefSeq"},{"location":"custom_databases/#univec-univec_core","text":"\"UniVec is a non-redundant database of sequences commonly attached to cDNA or genomic DNA during the cloning process.\" Website . Useful to screen for vector and linker/adapter contamination. UniVec_core is a sub-set of the UniVec selected to reduce the false positive hits from real biological sources. # UniVec wget -O \"UniVec.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec\" echo -e \"UniVec.fasta\\tUniVec\\t81077\" > UniVec_ganon_input_file.tsv ganon build-custom --input-file UniVec_ganon_input_file.tsv --db-prefix UniVec --level leaves --threads 8 --skip-genome-size # UniVec_Core wget -O \"UniVec_Core.fasta\" --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/UniVec_Core\" echo -e \"UniVec_Core.fasta\\tUniVec_Core\\t81077\" > UniVec_Core_ganon_input_file.tsv ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves --threads 8 --skip-genome-size Note All UniVec entries in the examples are categorized as Artificial Sequence (NCBI txid:81077). Some are completely artificial but others may be derived from real biological sources. More information in this link .","title":"UniVec, UniVec_core"},{"location":"custom_databases/#mgnify-genome-catalogues-mags","text":"\"Genome catalogues are biome-specific collections of metagenomic-assembled and isolate genomes\". Article / Website / FTP . Currently available genome catalogues (2024-02-09): chicken-gut cow-rumen honeybee-gut human-gut human-oral human-vaginal marine mouse-gut non-model-fish-gut pig-gut zebrafish-fecal List currently available entries curl --silent --list-only ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/ Example on how to download and build the human-oral catalog: # Download metadata wget \"https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-oral/v1.0.1/genomes-all_metadata.tsv\" # Download sequence files with 12 threads tail -n+2 genomes-all_metadata.tsv | cut -f 1,20 | xargs -P 12 -n2 sh -c 'curl --silent ${1}| gzip -d | sed -e \"1,/##FASTA/ d\" | gzip > ${0}.fna.gz' # Generate ganon input file tail -n+2 genomes-all_metadata.tsv | cut -f 1,15 | tr ';' '\\t' | awk -F\"\\t\" '{tax=\"1\";for(i=NF;i>1;i--){if(length($i)>3){tax=$i;break;}};print $1\".fna.gz\\t\"$1\"\\t\"tax}' > ganon_input_file.tsv # Build ganon database ganon build-custom --input-file ganon_input_file.tsv --db-prefix mgnify_human_oral_v101 --taxonomy gtdb --level leaves --threads 8 Note MGnify genomes catalogues will be build with GTDB taxonomy.","title":"MGnify genome catalogues (MAGs)"},{"location":"custom_databases/#pathogen-detection-fda-argos","text":"A collection of >1400 \"microbes that include biothreat microorganisms, common clinical pathogens and closely related species\". Article / Website / BioProject . # Download sequence files wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt grep \"strain=FDAARGOS\" assembly_summary_refseq.txt > fdaargos_assembly_summary.txt genome_updater.sh -e fdaargos_assembly_summary.txt -f \"genomic.fna.gz\" -o download -m -t 12 # Build ganon database ganon build-custom --input download/ --input-recursive --db-prefix fdaargos --ncbi-file-info download/assembly_summary.txt --level assembly --threads 32 Note The example above uses genome_updater to download files","title":"Pathogen detection FDA-ARGOS"},{"location":"custom_databases/#blast-databases-nt-env_nt-nt_prok","text":"BLAST databases. Website / FTP . Current available nucleotide databases (2024-02-09): 16S_ribosomal_RNA 18S_fungal_sequences 28S_fungal_sequences Betacoronavirus env_nt human_genome ITS_eukaryote_sequences ITS_RefSeq_Fungi LSU_eukaryote_rRNA LSU_prokaryote_rRNA mito mouse_genome nt nt_euk nt_others nt_prok nt_viruses patnt pdbnt ref_euk_rep_genomes ref_prok_rep_genomes refseq_rna refseq_select_rna ref_viroids_rep_genomes ref_viruses_rep_genomes SSU_eukaryote_rRNA tsa_nt List currently available entries curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"nucl-metadata.json\" | sed 's/-nucl-metadata.json/, /g' | sort Warning Some BLAST databases are very big and may require extreme computational resources to build. You may need to use some reduction strategies . The example shows how to download , parse and build a ganon database from BLAST database files. It does so by splitting the database into taxonomic specific files, to speed-up the build process: # Define BLAST db db=\"16S_ribosomal_RNA\" threads=8 # Download BLAST db - re-run this command many times until all finish (no more output) curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep \"^${db}\\..*tar.gz$\" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress \"https://ftp.ncbi.nlm.nih.gov/blast/db/{}\" # OPTIONAL Download and check MD5 wget -O - -nd --quiet --show-progress \"ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}\\.*tar.gz.md5\" > \"${db}.md5\" find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads:-1} -I{} md5sum {} > \"${db}_downloaded.md5\" diff -sy <(sort -k 2,2 \"${db}.md5\") <(sort -k 2,2 \"${db}_downloaded.md5\") # Should print \"Files /dev/fd/xx and /dev/fd/xx are identical\" # Extract BLAST db files, if successful, remove .tar.gz find -name \"${db}.*tar.gz\" -type f -printf '%P\\n' | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > \"${db}_extracted_files.txt\" # Create folder to write sequence files (split into 10 sub-folders) seq 0 9 | xargs -i mkdir -p \"${db}\"/{} # This command extracts sequences from the blastdb and writes them into taxid specific files # It also generates the --input-file for ganon with the fields: filepath file taxid blastdbcmd -entry all -db \"${db}\" -outfmt \"%a %T %s\" | \\ awk -v db=\"$(realpath ${db})\" '{file=db\"/\"substr($2,1,1)\"/\"$2\".fna\"; print \">\"$1\"\\n\"$3 >> file; print file\"\\t\"$2\".fna\\t\"$2}' | \\ sort | uniq > \"${db}_ganon_input_file.tsv\" # Build ganon database ganon build-custom --input-file \"${db}_ganon_input_file.tsv\" --db-prefix \"${db}\" --threads ${threads} --level leaves # Delete extracted files and auxiliary files cat \"${db}_extracted_files.txt\" | xargs rm rm \"${db}_extracted_files.txt\" \"${db}.md5\" \"${db}_downloaded.md5\" # Delete sequences and input_file rm -rf \"${db}\" \"${db}_ganon_input_file.tsv\" Note blastdbcmd is a command from BLAST+ software suite (tested version 2.14.0) and should be installed separately.","title":"BLAST databases (nt env_nt nt_prok ...)"},{"location":"custom_databases/#files-from-genome_updater","text":"To create a ganon database from files previously downloaded with genome_updater : ganon build-custom --input output_folder_genome_updater/version/ --input-recursive --db-prefix mydb --ncbi-file-info output_folder_genome_updater/assembly_summary.txt --level assembly --threads 32","title":"Files from genome_updater"},{"location":"custom_databases/#parameter-details","text":"","title":"Parameter details"},{"location":"custom_databases/#false-positive-and-size-max-fp-filter-size","text":"ganon indices are based on bloom filters and can have false positive matches. This can be controlled with --max-fp parameter. The lower the --max-fp , the less chances of false positives matches on classification, but the larger the database size will be. For example, with --max-fp 0.01 the database will be build so any target (defined by --level ) will have 1 in a 100 change of reporting a false k-mer match. The false positive of the query (all k-mers of a read) will be way lower, but directly affected by this value. Alternatively, one can set a specific size for the final index with --filter-size . When using this option, please observe the theoretic false positive of the index reported at the end of the building process.","title":"False positive and size (--max-fp, --filter-size)"},{"location":"custom_databases/#minimizers-window-size-kmer-size","text":"in ganon build , when --window-size > --kmer-size minimizers are used. That means that for a every window, a single k-mer will be selected. It produces smaller database files and requires substantially less memory overall. It may increase building times but will have a huge benefit for classification times. Sensitivity and precision can be reduced by small margins. If --window-size = --kmer-size , all k-mers are going to be used to build the database.","title":"minimizers (--window-size, --kmer-size)"},{"location":"custom_databases/#target-file-or-sequence-input-target","text":"This is a parameter that defines how ganon will parse your input files: - --input-target file (default) will consider every file provided with --input a single unit (e.g. multi-fasta files are considered one input, sequence headers ignored). - --input-target sequence will use every sequence as a unit. For this, ganon will first decompose every sequence in the input files provided with --input into a separated file. This will take longer and use more disk space. --input-target file is the default behavior and most efficient way to build databases. --input-target sequence should only be used when the input sequences are not separated by file (e.g. a single big FASTA file) or when classification at sequence level is desired.","title":"Target file or sequence (--input-target)"},{"location":"custom_databases/#build-level-level","text":"The --level parameter defines the max. depth of the database for classification. This parameter is relevant because the --max-fp is going to be guaranteed at the --level chosen. In ganon build the default value is species . In ganon build-custom the level will be the same as --input-target , meaning that classification will be done either at file or sequence level. Alternatively, --level assembly will link the file or sequence target information with assembly accessions retrieved from NCBI. --level leaves or --level species (or genus , family , ...) will link the targets with taxonomic information and prune the tree at the chosen level. --level custom will use specialization (4th col.) defined in the --input-file .","title":"Build level (--level)"},{"location":"custom_databases/#genome-sizes-genome-size-files","text":"Ganon will automatically download auxiliary files to define an approximate genome size for each entry in the taxonomic tree. For --taxonomy ncbi the species_genome_size.txt.gz is used. For --taxonomy gtdb the *_metadata.tar.gz files are used. Those files can be directly provided with the --genome-size-files argument. Genome sizes of parent nodes are calculated as the average of the respective children nodes. Other nodes without direct assigned genome sizes will use the closest parent with a pre-calculated genome size. The genome sizes are stored in the ganon database .","title":"Genome sizes (--genome-size-files)"},{"location":"custom_databases/#retrieving-info-ncbi-sequence-info-ncbi-file-info","text":"Further taxonomy and assembly linking information has to be collected to properly build the database. --ncbi-sequence-info and --ncbi-file-info allow customizations on this step. When --input-target sequence , --ncbi-sequence-info argument allows the use of NCBI e-utils webservices ( eutils ) or downloads accession2taxid files to extract target information (options nucl_gb nucl_wgs nucl_est nucl_gss pdb prot dead_nucl dead_wgs dead_prot ). By default, ganon uses eutils up-to 50000 input sequences, otherwise it downloads nucl_gb nucl_wgs from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/. Previously downloaded files can be directly provided with this argument. When --input-target file , --ncbi-file-info uses assembly_summary.txt from https://ftp.ncbi.nlm.nih.gov/genomes/ to extract target information (options refseq genbank refseq_historical genbank_historical . Previously downloaded files can be directly provided with this argument. If you are using outdated, removed or inactive assembly or sequence files and accessions from NCBI, make sure to include dead_nucl dead_wgs for --ncbi-sequence-info or refseq_historical genbank_historical for --ncbi-file-info . eutils option does not work with outdated accessions.","title":"Retrieving info (--ncbi-sequence-info, --ncbi-file-info)"},{"location":"default_databases/","text":"Databases :) ganon automates the download, update and build of databases based on NCBI RefSeq and GenBank genomes repositories wtih ganon build and update commands, for example: ganon build -g archaea bacteria -d arc_bac -c -t 30 This will download archaeal and bacterial complete genomes from RefSeq and build a database with 30 threads. Some day later, the database can be updated to include newest genomes with: ganon update -d arc_bac -t 30 Additionally, custom databases can be built with customized files and identifiers with the ganon build-custom command. Info We DO NOT provide pre-built indices for download. ganon can build databases very efficiently. This way, you will always have up-to-date reference sequences and get most out of your data. RefSeq and GenBank :) NCBI RefSeq and GenBank repositories are common resources to obtain reference sequences to analyze metagenomics data. They are mainly divided into domains/organism groups (e.g. archaea, bacteria, fungi, ...) but can be further filtered. The choice of those filters can drastically change the outcome of results. Commonly used sub-sets :) RefSeq (2023-03-14) # assemblies # species Size* ganon build All genomes 295219 52781 160 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs All genomes - 1 assembly/species 52781 52781 128 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_t1s Complete genomes 44121 19715 35 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_rs_cg Complete genomes - 1 assembly/species 19715 19715 29 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_cg_t1s Representative genomes 18073 18073 69 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --representative-genomes --db-prefix abfv_rs_rg GenBank (2023-03-14) # assemblies # species Size* ganon build All genomes 1595845 99505 - ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb All genomes - 1 assembly/species 99505 99505 300 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_gb_t1s Complete genomes 92917 34815 42 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg Complete genomes - 1 assembly/species 34815 34815 34 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes \"-A 'species:1'\" --db-prefix abfv_gb_cg_t1s Info Data obtained in 2023-03-14 for archaea, bacteria, fungi and viral groups only. By the time you are reading this, those numbers certainly grew a bit. The commands provided will download up-to-date assemblies and will require slightly larger resources. GTDB R214 # assemblies # species Size* ganon build All genomes 402709 85205 260 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --db-prefix ab_gtdb All genomes - 1 assembly/species 85205 85205 213 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --top 1 --db-prefix ab_gtdb_t1s Info GTDB covers only bacteria and archaea groups and has assemblies from RefSeq and GenBank. * in GB -> ganon requires up-to 2x the database size of memory to build it. The memory required to use it in classification is approx. the same as the database size. As a rule of thumb, the more the better, so choose the most comprehensive sub-set as possible given your computational resources It is possible to build databases that consume a fixed size/RAM usage. Beware that smaller filters will increase the false positive rates when classifying. Other approaches can reduce the size/RAM requirements with some trade-offs . Alternatively, you can build one database for each organism group separately and use them in ganon classify in any order or even stack them hierarchically . This way combination of multiple databases are possible, extending use cases. Further examples of commonly used database can be found here . Specific organisms or taxonomic groups :) It is also possible to generate databases for specific organisms or taxonomic branches with -a/--taxid , for example: ganon build --source refseq --taxid 562 317 --threads 48 --db-prefix coli_syringae will download and build a database for all Escherichia coli (taxid:562) and Pseudomonas syringae (taxid:317) assemblies from RefSeq. More filter options :) ganon uses genome_updater to manage downloads and further specific options and filters can be provided with the paramer -u/--genome-updater , for example: ganon build -g bacteria -t 48 -d bac_refseq --genome-updater \"-A 'genus:3' -E 20230101\" will download top 3 archaeal assemblies for each genus with date before 2023-01-01. For more information about genome_updater parameters, please check the repository . GTDB :) By default, ganon will use the NCBI Taxonomy to build the database. However, GTDB is fully supported and can be used with the parameter --taxonomy gtdb . Filtering by taxonomic entries also work with GTDB, for example: ganon build --db-prefix fuso_gtdb --taxid \"f__Fusobacteriaceae\" --source refseq genbank --taxonomy gtdb --threads 12 Update (ganon update) :) Default ganon databases generated with the ganon build can be updated with ganon update . This procedure will download new files and re-generate the ganon database with the updated entries. For example, a database generated with the following command: ganon build --db-prefix arc_cg_rs --source refseq --organism-group archaea --complete-genomes --threads 12 will contain all archaeal complete genomes from NCBI RefSeq at the time of running. Some days later, the database can be updated, fetching only new sequences added to the NCBI repository with the command: ganon update --db-prefix arc_cg_rs --threads 12 Tip To not overwrite the current database and create a new one with the updated files, use the --output-db-prefix parameter. Reproducibility :) If you use ganon with default databases and want to re-generate it later or keep track of the content for reproducibility purposes, you can save the assembly_summary.txt file located inside the {output_prefix}_files/ directory. To re-download the exact same snapshot of files used, one could use genome_updater , for example: genome_updater.sh -e assembly_summary.txt -f \"genomic.fna.gz\" -o recovered_files -m -t 12 Reducing database size :) Filter type (IBF and HIBF) :) The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates smaller databases with faster query times ( article ). However, the HIBF takes a little longer to build and has less flexibility regarding size and further options in ganon. You can choose which filter to use with the --filter-type parameter in ganon build and ganon build-custom . Due to differences between the default IBF used in ganon and the HIBF, it is recommended to lower the false positive when using the HIBF. The default value for high sensitivity is 1% ( --filter-type hibf --max-fp 0.001 ). Hint For large unbalanced reference sets, lots of reads to query -> HIBF (default) For quick database build and more flexibility -> IBF False positive rate :) A higher --max-fp value will generate a smaller database but with a higher number of false positive matches on classification. More details . Values between 0.001 (0.1%) and 0.3 (30%) are generally used. Hint When using higher --max-fp values, more false positive results may be generated. This can be filtered with the --fpr-query parameter in ganon classify k-mer and window size :) Define how much unique information is stored in the database. More details The smaller the --kmer-size , the less unique they will be, reducing database size but also sensitivity in classification. The bigger the --window-size , the less information needs to be stored resulting in smaller databases but with decrease classification accuracy. Top assemblies :) RefSeq and GenBank are highly biased toward some few organisms. This means that some species are highly represented in number of assemblies compared to others. This can not only bias analysis but also brings redundancy to the database. Choosing a certain number of top assemblies can mitigate those issues. Database sizes can also be drastically reduced without this redundancy, but \"strain-level\" analysis are then not possible. We recommend using top assemblies for larger and comprehensive reference sets (like the ones listed above ) and use the full set of assemblies for specific clade analysis. Example ganon build --top 1 will select one assembly for each taxonomic leaf (NCBI taxonomy still has strain, sub-species, ...) ganon build --genome-updater \"-A 'species:1'\" will select one assembly for each species ganon build --genome-updater \"-A 'genus:3'\" will select three assemblies for each genus Split databases :) Ganon allows classification with multiple databases in one level or in an hierarchy ( More details ). This means that databases can be built separately and used in any combination as desired. There are usually some benefits of doing so: Smaller databases when building by organism group, for example: one for bacteria, another for viruses, ... since average genome sizes are quite different. Easier to maintain and update. Extend use cases and avoid misclassification due to contaminated databases. Use databases as quality control, for example: remove reads matching one database of host or vectors (check out ganon report --skip-hierarchy ). Fixed size and Mode (only for --filter-type ibf) :) A fixed size for the database filter can be defined with --filter-size when using --filter-type ibf . The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. More details . --mode offers 5 different categories to build a database controlling the trade-off between size and classification speed. avg : Balanced mode smaller or smallest : create smaller databases with slower classification speed fast or fastest : create bigger databases with faster classification speed Warning If --filter-size is used, smaller and smallest refers to the false positive and not to the database size (which is fixed). Example :) Besides the benefits of using HIBF and specific sub-sets of big repositories shown on the default databases table , examples of other reduction strategies with IBF can be seen below: RefSeq archaeal complete genomes from 2023-05-05 Strategy Size (MB) Smaller Trade-off default 318 - - cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --db-prefix arc_rs_cg --filter-type ibf --mode smallest 301 5% Slower classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --mode smallest --db-prefix arc_rs_cg_smallest --filter-type ibf --filter-size 256 256 19% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --filter-size 256 --db-prefix arc_rs_cg_fs256 --filter-type ibf --window-size 35 249 21% Less sensitive classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --window-size 35 --db-prefix arc_rs_cg_ws35 --filter-type ibf --max-fp 0.2 190 40% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --max-fp 0.2 --db-prefix arc_rs_cg_fp0.2 --filter-type ibf Note This is an illustrative example and the reduction proportions for different configuration may be quite different","title":"Databases (ganon build)"},{"location":"default_databases/#databases","text":"ganon automates the download, update and build of databases based on NCBI RefSeq and GenBank genomes repositories wtih ganon build and update commands, for example: ganon build -g archaea bacteria -d arc_bac -c -t 30 This will download archaeal and bacterial complete genomes from RefSeq and build a database with 30 threads. Some day later, the database can be updated to include newest genomes with: ganon update -d arc_bac -t 30 Additionally, custom databases can be built with customized files and identifiers with the ganon build-custom command. Info We DO NOT provide pre-built indices for download. ganon can build databases very efficiently. This way, you will always have up-to-date reference sequences and get most out of your data.","title":"Databases"},{"location":"default_databases/#refseq-and-genbank","text":"NCBI RefSeq and GenBank repositories are common resources to obtain reference sequences to analyze metagenomics data. They are mainly divided into domains/organism groups (e.g. archaea, bacteria, fungi, ...) but can be further filtered. The choice of those filters can drastically change the outcome of results.","title":"RefSeq and GenBank"},{"location":"default_databases/#commonly-used-sub-sets","text":"RefSeq (2023-03-14) # assemblies # species Size* ganon build All genomes 295219 52781 160 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs All genomes - 1 assembly/species 52781 52781 128 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_t1s Complete genomes 44121 19715 35 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_rs_cg Complete genomes - 1 assembly/species 19715 19715 29 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --genome-updater \"-A 'species:1'\" --db-prefix abfv_rs_cg_t1s Representative genomes 18073 18073 69 ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --representative-genomes --db-prefix abfv_rs_rg GenBank (2023-03-14) # assemblies # species Size* ganon build All genomes 1595845 99505 - ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb All genomes - 1 assembly/species 99505 99505 300 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --genome-updater \"-A 'species:1'\" --db-prefix abfv_gb_t1s Complete genomes 92917 34815 42 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg Complete genomes - 1 assembly/species 34815 34815 34 ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes \"-A 'species:1'\" --db-prefix abfv_gb_cg_t1s Info Data obtained in 2023-03-14 for archaea, bacteria, fungi and viral groups only. By the time you are reading this, those numbers certainly grew a bit. The commands provided will download up-to-date assemblies and will require slightly larger resources. GTDB R214 # assemblies # species Size* ganon build All genomes 402709 85205 260 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --db-prefix ab_gtdb All genomes - 1 assembly/species 85205 85205 213 ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --top 1 --db-prefix ab_gtdb_t1s Info GTDB covers only bacteria and archaea groups and has assemblies from RefSeq and GenBank. * in GB -> ganon requires up-to 2x the database size of memory to build it. The memory required to use it in classification is approx. the same as the database size. As a rule of thumb, the more the better, so choose the most comprehensive sub-set as possible given your computational resources It is possible to build databases that consume a fixed size/RAM usage. Beware that smaller filters will increase the false positive rates when classifying. Other approaches can reduce the size/RAM requirements with some trade-offs . Alternatively, you can build one database for each organism group separately and use them in ganon classify in any order or even stack them hierarchically . This way combination of multiple databases are possible, extending use cases. Further examples of commonly used database can be found here .","title":"Commonly used sub-sets"},{"location":"default_databases/#specific-organisms-or-taxonomic-groups","text":"It is also possible to generate databases for specific organisms or taxonomic branches with -a/--taxid , for example: ganon build --source refseq --taxid 562 317 --threads 48 --db-prefix coli_syringae will download and build a database for all Escherichia coli (taxid:562) and Pseudomonas syringae (taxid:317) assemblies from RefSeq.","title":"Specific organisms or taxonomic groups"},{"location":"default_databases/#more-filter-options","text":"ganon uses genome_updater to manage downloads and further specific options and filters can be provided with the paramer -u/--genome-updater , for example: ganon build -g bacteria -t 48 -d bac_refseq --genome-updater \"-A 'genus:3' -E 20230101\" will download top 3 archaeal assemblies for each genus with date before 2023-01-01. For more information about genome_updater parameters, please check the repository .","title":"More filter options"},{"location":"default_databases/#gtdb","text":"By default, ganon will use the NCBI Taxonomy to build the database. However, GTDB is fully supported and can be used with the parameter --taxonomy gtdb . Filtering by taxonomic entries also work with GTDB, for example: ganon build --db-prefix fuso_gtdb --taxid \"f__Fusobacteriaceae\" --source refseq genbank --taxonomy gtdb --threads 12","title":"GTDB"},{"location":"default_databases/#update-ganon-update","text":"Default ganon databases generated with the ganon build can be updated with ganon update . This procedure will download new files and re-generate the ganon database with the updated entries. For example, a database generated with the following command: ganon build --db-prefix arc_cg_rs --source refseq --organism-group archaea --complete-genomes --threads 12 will contain all archaeal complete genomes from NCBI RefSeq at the time of running. Some days later, the database can be updated, fetching only new sequences added to the NCBI repository with the command: ganon update --db-prefix arc_cg_rs --threads 12 Tip To not overwrite the current database and create a new one with the updated files, use the --output-db-prefix parameter.","title":"Update (ganon update)"},{"location":"default_databases/#reproducibility","text":"If you use ganon with default databases and want to re-generate it later or keep track of the content for reproducibility purposes, you can save the assembly_summary.txt file located inside the {output_prefix}_files/ directory. To re-download the exact same snapshot of files used, one could use genome_updater , for example: genome_updater.sh -e assembly_summary.txt -f \"genomic.fna.gz\" -o recovered_files -m -t 12","title":"Reproducibility"},{"location":"default_databases/#reducing-database-size","text":"","title":"Reducing database size"},{"location":"default_databases/#filter-type-ibf-and-hibf","text":"The Hierarchical Interleaved Bloom Filter (HIBF) is an improvement over the default Interleaved Bloom Filter (IBF) and generates smaller databases with faster query times ( article ). However, the HIBF takes a little longer to build and has less flexibility regarding size and further options in ganon. You can choose which filter to use with the --filter-type parameter in ganon build and ganon build-custom . Due to differences between the default IBF used in ganon and the HIBF, it is recommended to lower the false positive when using the HIBF. The default value for high sensitivity is 1% ( --filter-type hibf --max-fp 0.001 ). Hint For large unbalanced reference sets, lots of reads to query -> HIBF (default) For quick database build and more flexibility -> IBF","title":"Filter type (IBF and HIBF)"},{"location":"default_databases/#false-positive-rate","text":"A higher --max-fp value will generate a smaller database but with a higher number of false positive matches on classification. More details . Values between 0.001 (0.1%) and 0.3 (30%) are generally used. Hint When using higher --max-fp values, more false positive results may be generated. This can be filtered with the --fpr-query parameter in ganon classify","title":"False positive rate"},{"location":"default_databases/#k-mer-and-window-size","text":"Define how much unique information is stored in the database. More details The smaller the --kmer-size , the less unique they will be, reducing database size but also sensitivity in classification. The bigger the --window-size , the less information needs to be stored resulting in smaller databases but with decrease classification accuracy.","title":"k-mer and window size"},{"location":"default_databases/#top-assemblies","text":"RefSeq and GenBank are highly biased toward some few organisms. This means that some species are highly represented in number of assemblies compared to others. This can not only bias analysis but also brings redundancy to the database. Choosing a certain number of top assemblies can mitigate those issues. Database sizes can also be drastically reduced without this redundancy, but \"strain-level\" analysis are then not possible. We recommend using top assemblies for larger and comprehensive reference sets (like the ones listed above ) and use the full set of assemblies for specific clade analysis. Example ganon build --top 1 will select one assembly for each taxonomic leaf (NCBI taxonomy still has strain, sub-species, ...) ganon build --genome-updater \"-A 'species:1'\" will select one assembly for each species ganon build --genome-updater \"-A 'genus:3'\" will select three assemblies for each genus","title":"Top assemblies"},{"location":"default_databases/#split-databases","text":"Ganon allows classification with multiple databases in one level or in an hierarchy ( More details ). This means that databases can be built separately and used in any combination as desired. There are usually some benefits of doing so: Smaller databases when building by organism group, for example: one for bacteria, another for viruses, ... since average genome sizes are quite different. Easier to maintain and update. Extend use cases and avoid misclassification due to contaminated databases. Use databases as quality control, for example: remove reads matching one database of host or vectors (check out ganon report --skip-hierarchy ).","title":"Split databases"},{"location":"default_databases/#fixed-size-and-mode-only-for-filter-type-ibf","text":"A fixed size for the database filter can be defined with --filter-size when using --filter-type ibf . The smaller the filter size, the higher the false positive chances on classification. When using a fixed filter size, ganon will report the max. and avg. false positive rate at the end of the build. More details . --mode offers 5 different categories to build a database controlling the trade-off between size and classification speed. avg : Balanced mode smaller or smallest : create smaller databases with slower classification speed fast or fastest : create bigger databases with faster classification speed Warning If --filter-size is used, smaller and smallest refers to the false positive and not to the database size (which is fixed).","title":"Fixed size and Mode (only for --filter-type ibf)"},{"location":"default_databases/#example","text":"Besides the benefits of using HIBF and specific sub-sets of big repositories shown on the default databases table , examples of other reduction strategies with IBF can be seen below: RefSeq archaeal complete genomes from 2023-05-05 Strategy Size (MB) Smaller Trade-off default 318 - - cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --db-prefix arc_rs_cg --filter-type ibf --mode smallest 301 5% Slower classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --mode smallest --db-prefix arc_rs_cg_smallest --filter-type ibf --filter-size 256 256 19% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --filter-size 256 --db-prefix arc_rs_cg_fs256 --filter-type ibf --window-size 35 249 21% Less sensitive classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --window-size 35 --db-prefix arc_rs_cg_ws35 --filter-type ibf --max-fp 0.2 190 40% Higher false positive on classification cmd ganon build --source refseq --organism-group archaea --threads 12 --complete-genomes --max-fp 0.2 --db-prefix arc_rs_cg_fp0.2 --filter-type ibf Note This is an illustrative example and the reduction proportions for different configuration may be quite different","title":"Example"},{"location":"outputfiles/","text":"Output files :) ganon build/build-custom/update :) Every run on ganon build , ganon build-custom or ganon update will generate the following database files: {prefix} .ibf/.hibf : main bloom filter index file, extension based on the --filter-type option. {prefix} .tax : taxonomy tree, only generated if --taxonomy is used (fields: target/node, parent, rank, name, genome size) . {prefix} _files/ : ( ganon build only) folder containing downloaded reference sequence and auxiliary files. Not necessary for classification. Keep this folder if the database will be update later. Otherwise it can be deleted. Warning Database files generated with version 1.2.0 or higher are not compatible with older versions. ganon classify :) {prefix} .tre : full report file (see below) {prefix} .rep : plain report of the run with only targets that received a match. Can be used to re-generate full reports (.tre) with ganon report . At the end prints 2 extra lines with #total_classified and #total_unclassified . Fields 1: hierarchy label 2: target 3: # total matches 4: # unique reads 5: # lca reads 6: rank 7: name {prefix} .one : output with one match for each classified read after EM or LCA algorithm. Only generated with --output-one active. If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.one (fields: read identifier, target, (max) k-mer/minimizer count) {prefix} .all : output with all matches for each read. Only generated with --output-all active Warning: file can be very large . If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.all (fields: read identifier, target, k-mer/minimizer count) ganon report :) {prefix} .tre : tab-separated tree-like report with cumulative counts and taxonomic lineage. There are several possible --report-type . More information on the different types of reports can be found here : abundance : will attempt to estimate taxonomic abundances by re-disributing read counts from LCA matches and correcting sequence abundance by approximate genome sizes. reads : sequence abundances , reports the proportion of sequences assigned to a taxa, each read classified is counted once. dist : like reads with read count re-distribution. corr : like reads with correction by genome size. matches : every match is reported to their original target, including multiple and shared matches. Each line in this report is a taxonomic entry (including the root node), with the following fields: col field obs example 1 rank phylum 2 target taxonomic id. or specialization (assembly id.) 562 3 lineage 1|131567|2|1224|28211|766|942|768|769 4 name Chromobacterium rhizoryzae 5 # unique number of reads that matched exclusively to this target 5 6 # shared number of reads with non-unique matches directly assigned to this target. Represents the LCA matches ( --report-type reads ), re-assigned matches ( --report-type abundance/dist ) or shared matches ( --report-type matches ) 10 7 # children number of unique and shared assignments to all children nodes of this target 20 8 # cumulative the sum of the unique, shared and children assignments up-to this target 35 9 % cumulative percentage of assignments or estimated relative abundance for --report-type abundance 43.24 The first line of the report file will show the number of unclassified reads (not for --report-type matches ) The CAMI challenge bioboxes profiling format is supported using --output-format bioboxes . In this format, only values for the percentage/abundance (col. 9) are reported. The root node and unclassified entries are omitted. The sum of cumulative assignments for the unclassified and root lines is 100%. The final cumulative sum of reads/matches may be under 100% if any filter is successfully applied and/or hierarchical selection is selected (keep/skip/split). For all report type but matches , only taxa that received direct read matches, either unique or by LCA assignment, are considered. Some reads may have only shared matches and will not be reported directly but will be accounted for on some parent level. To visualize those matches, create a report with --report-type matches or use directly the file {prefix} .rep . ganon table :) {output_file}: a tab-separated file with counts/percentages of taxa for multiple samples Examples of output files The main output file is the `{prefix}.tre` which will summarize the results: unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 superkingdom 2 1|2 Bacteria 0 0 97 97 97.97980 phylum 1239 1|2|1239 Firmicutes 0 0 57 57 57.57576 phylum 1224 1|2|1224 Proteobacteria 0 0 40 40 40.40404 class 91061 1|2|1239|91061 Bacilli 0 0 57 57 57.57576 class 28211 1|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 class 1236 1|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 1385 1|2|1239|91061|1385 Bacillales 0 0 57 57 57.57576 order 204458 1|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 order 72274 1|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 186822 1|2|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 family 76892 1|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 family 468 1|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 44249 1|2|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 genus 75 1|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 genus 469 1|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species 1406 1|2|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 species 366602 1|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 species 470 1|2|1224|1236|72274|468|469|470 Acinetobacter baumannii 12 0 0 12 12.12121 running `ganon classify` or `ganon report` with `--ranks all`, the output will show all ranks used for classification and presented sorted by lineage (also available with `ganon report --sort lineage`): unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 no rank 131567 1|131567 cellular organisms 0 0 97 97 97.97980 superkingdom 2 1|131567|2 Bacteria 0 0 97 97 97.97980 phylum 1224 1|131567|2|1224 Proteobacteria 0 0 40 40 40.40404 class 1236 1|131567|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 72274 1|131567|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 468 1|131567|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 469 1|131567|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species group 909768 1|131567|2|1224|1236|72274|468|469|909768 Acinetobacter calcoaceticus/baumannii complex 0 0 12 12 12.12121 species 470 1|131567|2|1224|1236|72274|468|469|909768|470 Acinetobacter baumannii 12 0 0 12 12.12121 class 28211 1|131567|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 order 204458 1|131567|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 family 76892 1|131567|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 genus 75 1|131567|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 species 366602 1|131567|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 no rank 1783272 1|131567|2|1783272 Terrabacteria group 0 0 57 57 57.57576 phylum 1239 1|131567|2|1783272|1239 Firmicutes 0 0 57 57 57.57576 class 91061 1|131567|2|1783272|1239|91061 Bacilli 0 0 57 57 57.57576 order 1385 1|131567|2|1783272|1239|91061|1385 Bacillales 0 0 57 57 57.57576 family 186822 1|131567|2|1783272|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 genus 44249 1|131567|2|1783272|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 species 1406 1|131567|2|1783272|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 with `--output-format bioboxes` @Version:0.10.0 @SampleID:example.rep H1 @Ranks:superkingdom|phylum|class|order|family|genus|species|assembly @Taxonomy:db.tax @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 100.00000 1224 phylum 2|1224 Bacteria|Proteobacteria 56.89782 201174 phylum 2|201174 Bacteria|Actinobacteria 21.84869 1239 phylum 2|1239 Bacteria|Firmicutes 9.75197 976 phylum 2|976 Bacteria|Bacteroidota 6.15297 1117 phylum 2|1117 Bacteria|Cyanobacteria 2.23146 203682 phylum 2|203682 Bacteria|Planctomycetota 1.23353 57723 phylum 2|57723 Bacteria|Acidobacteria 0.52549 200795 phylum 2|200795 Bacteria|Chloroflexi 0.31118","title":"Output files"},{"location":"outputfiles/#output-files","text":"","title":"Output files"},{"location":"outputfiles/#ganon-buildbuild-customupdate","text":"Every run on ganon build , ganon build-custom or ganon update will generate the following database files: {prefix} .ibf/.hibf : main bloom filter index file, extension based on the --filter-type option. {prefix} .tax : taxonomy tree, only generated if --taxonomy is used (fields: target/node, parent, rank, name, genome size) . {prefix} _files/ : ( ganon build only) folder containing downloaded reference sequence and auxiliary files. Not necessary for classification. Keep this folder if the database will be update later. Otherwise it can be deleted. Warning Database files generated with version 1.2.0 or higher are not compatible with older versions.","title":"ganon build/build-custom/update"},{"location":"outputfiles/#ganon-classify","text":"{prefix} .tre : full report file (see below) {prefix} .rep : plain report of the run with only targets that received a match. Can be used to re-generate full reports (.tre) with ganon report . At the end prints 2 extra lines with #total_classified and #total_unclassified . Fields 1: hierarchy label 2: target 3: # total matches 4: # unique reads 5: # lca reads 6: rank 7: name {prefix} .one : output with one match for each classified read after EM or LCA algorithm. Only generated with --output-one active. If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.one (fields: read identifier, target, (max) k-mer/minimizer count) {prefix} .all : output with all matches for each read. Only generated with --output-all active Warning: file can be very large . If multiple hierarchy levels are set, one file for each level will be created: {prefix}.{hierarchy}.all (fields: read identifier, target, k-mer/minimizer count)","title":"ganon classify"},{"location":"outputfiles/#ganon-report","text":"{prefix} .tre : tab-separated tree-like report with cumulative counts and taxonomic lineage. There are several possible --report-type . More information on the different types of reports can be found here : abundance : will attempt to estimate taxonomic abundances by re-disributing read counts from LCA matches and correcting sequence abundance by approximate genome sizes. reads : sequence abundances , reports the proportion of sequences assigned to a taxa, each read classified is counted once. dist : like reads with read count re-distribution. corr : like reads with correction by genome size. matches : every match is reported to their original target, including multiple and shared matches. Each line in this report is a taxonomic entry (including the root node), with the following fields: col field obs example 1 rank phylum 2 target taxonomic id. or specialization (assembly id.) 562 3 lineage 1|131567|2|1224|28211|766|942|768|769 4 name Chromobacterium rhizoryzae 5 # unique number of reads that matched exclusively to this target 5 6 # shared number of reads with non-unique matches directly assigned to this target. Represents the LCA matches ( --report-type reads ), re-assigned matches ( --report-type abundance/dist ) or shared matches ( --report-type matches ) 10 7 # children number of unique and shared assignments to all children nodes of this target 20 8 # cumulative the sum of the unique, shared and children assignments up-to this target 35 9 % cumulative percentage of assignments or estimated relative abundance for --report-type abundance 43.24 The first line of the report file will show the number of unclassified reads (not for --report-type matches ) The CAMI challenge bioboxes profiling format is supported using --output-format bioboxes . In this format, only values for the percentage/abundance (col. 9) are reported. The root node and unclassified entries are omitted. The sum of cumulative assignments for the unclassified and root lines is 100%. The final cumulative sum of reads/matches may be under 100% if any filter is successfully applied and/or hierarchical selection is selected (keep/skip/split). For all report type but matches , only taxa that received direct read matches, either unique or by LCA assignment, are considered. Some reads may have only shared matches and will not be reported directly but will be accounted for on some parent level. To visualize those matches, create a report with --report-type matches or use directly the file {prefix} .rep .","title":"ganon report"},{"location":"outputfiles/#ganon-table","text":"{output_file}: a tab-separated file with counts/percentages of taxa for multiple samples Examples of output files The main output file is the `{prefix}.tre` which will summarize the results: unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 superkingdom 2 1|2 Bacteria 0 0 97 97 97.97980 phylum 1239 1|2|1239 Firmicutes 0 0 57 57 57.57576 phylum 1224 1|2|1224 Proteobacteria 0 0 40 40 40.40404 class 91061 1|2|1239|91061 Bacilli 0 0 57 57 57.57576 class 28211 1|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 class 1236 1|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 1385 1|2|1239|91061|1385 Bacillales 0 0 57 57 57.57576 order 204458 1|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 order 72274 1|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 186822 1|2|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 family 76892 1|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 family 468 1|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 44249 1|2|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 genus 75 1|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 genus 469 1|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species 1406 1|2|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 species 366602 1|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 species 470 1|2|1224|1236|72274|468|469|470 Acinetobacter baumannii 12 0 0 12 12.12121 running `ganon classify` or `ganon report` with `--ranks all`, the output will show all ranks used for classification and presented sorted by lineage (also available with `ganon report --sort lineage`): unclassified unclassified 0 0 0 2 2.02020 root 1 1 root 0 0 97 97 97.97980 no rank 131567 1|131567 cellular organisms 0 0 97 97 97.97980 superkingdom 2 1|131567|2 Bacteria 0 0 97 97 97.97980 phylum 1224 1|131567|2|1224 Proteobacteria 0 0 40 40 40.40404 class 1236 1|131567|2|1224|1236 Gammaproteobacteria 0 0 12 12 12.12121 order 72274 1|131567|2|1224|1236|72274 Pseudomonadales 0 0 12 12 12.12121 family 468 1|131567|2|1224|1236|72274|468 Moraxellaceae 0 0 12 12 12.12121 genus 469 1|131567|2|1224|1236|72274|468|469 Acinetobacter 0 0 12 12 12.12121 species group 909768 1|131567|2|1224|1236|72274|468|469|909768 Acinetobacter calcoaceticus/baumannii complex 0 0 12 12 12.12121 species 470 1|131567|2|1224|1236|72274|468|469|909768|470 Acinetobacter baumannii 12 0 0 12 12.12121 class 28211 1|131567|2|1224|28211 Alphaproteobacteria 0 0 28 28 28.28283 order 204458 1|131567|2|1224|28211|204458 Caulobacterales 0 0 28 28 28.28283 family 76892 1|131567|2|1224|28211|204458|76892 Caulobacteraceae 0 0 28 28 28.28283 genus 75 1|131567|2|1224|28211|204458|76892|75 Caulobacter 0 0 28 28 28.28283 species 366602 1|131567|2|1224|28211|204458|76892|75|366602 Caulobacter sp. K31 28 0 0 28 28.28283 no rank 1783272 1|131567|2|1783272 Terrabacteria group 0 0 57 57 57.57576 phylum 1239 1|131567|2|1783272|1239 Firmicutes 0 0 57 57 57.57576 class 91061 1|131567|2|1783272|1239|91061 Bacilli 0 0 57 57 57.57576 order 1385 1|131567|2|1783272|1239|91061|1385 Bacillales 0 0 57 57 57.57576 family 186822 1|131567|2|1783272|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57 57.57576 genus 44249 1|131567|2|1783272|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57 57.57576 species 1406 1|131567|2|1783272|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 0 0 57 57.57576 with `--output-format bioboxes` @Version:0.10.0 @SampleID:example.rep H1 @Ranks:superkingdom|phylum|class|order|family|genus|species|assembly @Taxonomy:db.tax @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 100.00000 1224 phylum 2|1224 Bacteria|Proteobacteria 56.89782 201174 phylum 2|201174 Bacteria|Actinobacteria 21.84869 1239 phylum 2|1239 Bacteria|Firmicutes 9.75197 976 phylum 2|976 Bacteria|Bacteroidota 6.15297 1117 phylum 2|1117 Bacteria|Cyanobacteria 2.23146 203682 phylum 2|203682 Bacteria|Planctomycetota 1.23353 57723 phylum 2|57723 Bacteria|Acidobacteria 0.52549 200795 phylum 2|200795 Bacteria|Chloroflexi 0.31118","title":"ganon table"},{"location":"reports/","text":"Reports :) ganon report filters and generates several reports and summaries from the results obtained with ganon classify . It is possible to summarize the results in terms of taxonomic and sequence abundances as well as total number of matches. Examples :) Given the output .rep from ganon classify and the database used ( --db-prefix ): Taxonomic profile with abundance estimation (default) :) ganon report --db-prefix mydb --input results.rep --output-prefix tax_profile --report-type abundance Sequence profile :) ganon report --db-prefix mydb --input results.rep --output-prefix seq_profile --report-type reads Matches profile :) ganon report --db-prefix mydb --input results.rep --output-prefix matches --report-type matches Filtering results :) ganon report --db-prefix mydb --input results.rep --output-prefix filtered --min-count 0.0005 --top-percentile 0.8 This will keep only results with a min. abundance of 0.05% and only the top 80% most abundant. Parameter details :) report type (--report-type) :) Several reports are available with --report-type : reads , abundance , dist , corr , matches : reads reports sequence abundances which are the basic proportion of reads classified in the sample. abundance will convert sequence abundance into taxonomic abundances by re-distributing read counts among leaf nodes and correcting by genome size. The re-distribution applies for reads classified with a LCA assignment and it is proportional to the number of unique matches of leaf nodes available in the ganon database (relative to the LCA node). Genome size is estimated based on NCBI or GTDB auxiliary files . Genome size correction is applied by rank based on default ranks only (superkingdom phylum class order family genus species assembly). Read counts in intermediate ranks will be corrected based on the closest parent default rank and re-assigned to its original rank. dist is the same of reads with read count re-distribution corr is the same of reads with correction by genome size matches will report the total number of matches classified, either unique or shared. This option will output the total number of matches instead the total number of reads","title":"Reports (ganon report)"},{"location":"reports/#reports","text":"ganon report filters and generates several reports and summaries from the results obtained with ganon classify . It is possible to summarize the results in terms of taxonomic and sequence abundances as well as total number of matches.","title":"Reports"},{"location":"reports/#examples","text":"Given the output .rep from ganon classify and the database used ( --db-prefix ):","title":"Examples"},{"location":"reports/#taxonomic-profile-with-abundance-estimation-default","text":"ganon report --db-prefix mydb --input results.rep --output-prefix tax_profile --report-type abundance","title":"Taxonomic profile with abundance estimation (default)"},{"location":"reports/#sequence-profile","text":"ganon report --db-prefix mydb --input results.rep --output-prefix seq_profile --report-type reads","title":"Sequence profile"},{"location":"reports/#matches-profile","text":"ganon report --db-prefix mydb --input results.rep --output-prefix matches --report-type matches","title":"Matches profile"},{"location":"reports/#filtering-results","text":"ganon report --db-prefix mydb --input results.rep --output-prefix filtered --min-count 0.0005 --top-percentile 0.8 This will keep only results with a min. abundance of 0.05% and only the top 80% most abundant.","title":"Filtering results"},{"location":"reports/#parameter-details","text":"","title":"Parameter details"},{"location":"reports/#report-type-report-type","text":"Several reports are available with --report-type : reads , abundance , dist , corr , matches : reads reports sequence abundances which are the basic proportion of reads classified in the sample. abundance will convert sequence abundance into taxonomic abundances by re-distributing read counts among leaf nodes and correcting by genome size. The re-distribution applies for reads classified with a LCA assignment and it is proportional to the number of unique matches of leaf nodes available in the ganon database (relative to the LCA node). Genome size is estimated based on NCBI or GTDB auxiliary files . Genome size correction is applied by rank based on default ranks only (superkingdom phylum class order family genus species assembly). Read counts in intermediate ranks will be corrected based on the closest parent default rank and re-assigned to its original rank. dist is the same of reads with read count re-distribution corr is the same of reads with correction by genome size matches will report the total number of matches classified, either unique or shared. This option will output the total number of matches instead the total number of reads","title":"report type (--report-type)"},{"location":"start/","text":"Quick Start Guide :) Install :) conda install -c bioconda -c conda-forge ganon Download and Build a database :) Bacteria - NCBI RefSeq - representative genomes ganon build --db-prefix bac_rs_rg --source refseq --organism-group bacteria --representative-genomes --threads 24 If you want to test ganon functionalities with a smaller database, use archaea instead of bacteria in the example above. Classify and generate a tax. profile :) Download test reads ganon classify --db-prefix bac_rs_rg --output-prefix classify_results --single-reads H01_1M_0.1.fq.gz --threads 24 classify_results.tre -> taxonomic profile Important parameters :) The most important parameters and trade-offs to be aware of when using ganon: ganon build :) --level : Highest level to build the database. Can be a taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis. The more specific the level, the bigger the database will be. --max-fp : controls the false positive of the bloom filters. The higher the --max-fp , the smaller the databases at a cost of sensitivity in classification. --window-size --kmer-size : the window value should always be the same or larger than the k-mer value. The larger the difference between them, the smaller the database will be. However, some sensitivity/precision loss in classification is expected with small k-mer and/or large window . Larger k-mer values (e.g. 31 ) will improve classification, specially read binning, at a cost of larger databases. ganon classify :) --rel-cutoff : defines the min. percentage of k-mers shared to a reference to consider a match. Higher values will improve precision and decrease sensitivity. For taxonomic profiling, a higher value between 0.4 and 0.8 may provide better results. For read binning, lower values between 0.2 and 0.4 are recommended. lower values -> more read matches higher values -> less read matches --rel-filter : filter matches in relation to the best and worst after the cutoff is applied. 0 means only matches with top score (# of k-mers ) as the best match will be kept. lower values -> more unique matching reads higher values -> more multi-matching reads --multiple-matches : defines how ganon treats multiple-matching reads. Either by an EM-algorithm based on unique matches or a taxonomy-based LCA algorithm. ganon report :) --report-type : reports either taxonomic, sequence or matches abundances. Use corr or abundance for taxonomic profiling, reads or dist for sequence profiling and matches to report a summary of all matches. --min-count : cutoff to discard underrepresented taxa. Useful to remove the common long tail of spurious matches and false positives when performing classification. Values between 0.0001 (0.01%) and 0.001 (0.1%) improved sensitivity and precision in our evaluations. The higher the value, the more precise the outcome, with a sensitivity loss. Alternatively --top-percentile can be used to keep a relative amount of taxa instead a hard cutoff. The numeric values above are averages from several experiments with different sample types and database contents. They may not work as expected for your data. If you are not sure which values to use or see something unexpected, please open an issue .","title":"Quick Start"},{"location":"start/#quick-start-guide","text":"","title":"Quick Start Guide"},{"location":"start/#install","text":"conda install -c bioconda -c conda-forge ganon","title":"Install"},{"location":"start/#download-and-build-a-database","text":"Bacteria - NCBI RefSeq - representative genomes ganon build --db-prefix bac_rs_rg --source refseq --organism-group bacteria --representative-genomes --threads 24 If you want to test ganon functionalities with a smaller database, use archaea instead of bacteria in the example above.","title":"Download and Build a database"},{"location":"start/#classify-and-generate-a-tax-profile","text":"Download test reads ganon classify --db-prefix bac_rs_rg --output-prefix classify_results --single-reads H01_1M_0.1.fq.gz --threads 24 classify_results.tre -> taxonomic profile","title":"Classify and generate a tax. profile"},{"location":"start/#important-parameters","text":"The most important parameters and trade-offs to be aware of when using ganon:","title":"Important parameters"},{"location":"start/#ganon-build","text":"--level : Highest level to build the database. Can be a taxonomic rank [species, genus, ...], 'leaves' for taxonomic leaves or 'assembly' for a assembly/strain based analysis. The more specific the level, the bigger the database will be. --max-fp : controls the false positive of the bloom filters. The higher the --max-fp , the smaller the databases at a cost of sensitivity in classification. --window-size --kmer-size : the window value should always be the same or larger than the k-mer value. The larger the difference between them, the smaller the database will be. However, some sensitivity/precision loss in classification is expected with small k-mer and/or large window . Larger k-mer values (e.g. 31 ) will improve classification, specially read binning, at a cost of larger databases.","title":"ganon build"},{"location":"start/#ganon-classify","text":"--rel-cutoff : defines the min. percentage of k-mers shared to a reference to consider a match. Higher values will improve precision and decrease sensitivity. For taxonomic profiling, a higher value between 0.4 and 0.8 may provide better results. For read binning, lower values between 0.2 and 0.4 are recommended. lower values -> more read matches higher values -> less read matches --rel-filter : filter matches in relation to the best and worst after the cutoff is applied. 0 means only matches with top score (# of k-mers ) as the best match will be kept. lower values -> more unique matching reads higher values -> more multi-matching reads --multiple-matches : defines how ganon treats multiple-matching reads. Either by an EM-algorithm based on unique matches or a taxonomy-based LCA algorithm.","title":"ganon classify"},{"location":"start/#ganon-report","text":"--report-type : reports either taxonomic, sequence or matches abundances. Use corr or abundance for taxonomic profiling, reads or dist for sequence profiling and matches to report a summary of all matches. --min-count : cutoff to discard underrepresented taxa. Useful to remove the common long tail of spurious matches and false positives when performing classification. Values between 0.0001 (0.01%) and 0.001 (0.1%) improved sensitivity and precision in our evaluations. The higher the value, the more precise the outcome, with a sensitivity loss. Alternatively --top-percentile can be used to keep a relative amount of taxa instead a hard cutoff. The numeric values above are averages from several experiments with different sample types and database contents. They may not work as expected for your data. If you are not sure which values to use or see something unexpected, please open an issue .","title":"ganon report"},{"location":"table/","text":"Table :) ganon table filters and summarizes several reports obtained with ganon report into a table. Filters for each sample or for averages among all samples can also be applied. Examples :) Given several .tre from ganon report : Counts of species :) ganon table --input *.tre --output-file table.tsv --rank species Abundance of species :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species Top 10 species (among all samples) :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-all 10 Top 10 species (from each samples) :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-sample 10 Filtering results :) ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --min-count 0.0005 This will keep only results with a min. abundance of 0.05% .","title":"Table (ganon table)"},{"location":"table/#table","text":"ganon table filters and summarizes several reports obtained with ganon report into a table. Filters for each sample or for averages among all samples can also be applied.","title":"Table"},{"location":"table/#examples","text":"Given several .tre from ganon report :","title":"Examples"},{"location":"table/#counts-of-species","text":"ganon table --input *.tre --output-file table.tsv --rank species","title":"Counts of species"},{"location":"table/#abundance-of-species","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species","title":"Abundance of species"},{"location":"table/#top-10-species-among-all-samples","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-all 10","title":"Top 10 species (among all samples)"},{"location":"table/#top-10-species-from-each-samples","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --top-sample 10","title":"Top 10 species (from each samples)"},{"location":"table/#filtering-results","text":"ganon table --input *.tre --output-file table.tsv --output-value percentage --rank species --min-count 0.0005 This will keep only results with a min. abundance of 0.05% .","title":"Filtering results"},{"location":"tutorials/","text":"Tutorials :) ... soon ...","title":"Tutorials"},{"location":"tutorials/#tutorials","text":"... soon ...","title":"Tutorials"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index cb272fa6..63ea9dbf 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ