Skip to content

Commit

Permalink
Merge pull request #258 from pirovc/dev
Browse files Browse the repository at this point in the history
ganon version 1.8.0
  • Loading branch information
pirovc committed Aug 23, 2023
2 parents 670d003 + c50223e commit 6558c70
Show file tree
Hide file tree
Showing 14 changed files with 154 additions and 100 deletions.
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# =============================================================================

cmake_minimum_required( VERSION 3.10 FATAL_ERROR )
project( ganon VERSION 1.7.0 LANGUAGES CXX )
project( ganon VERSION 1.8.0 LANGUAGES CXX )

# -----------------------------------------------------------------------------
# build setup
Expand Down
2 changes: 1 addition & 1 deletion docs/classification.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ Note that reads that remain with only one reference match (after `cutoff` and `f

ganon uses Bloom Filters, probabilistic data structures that may return false positive results. The base false positive of a ganon index is controlled by `--max-fp` when building the database. However, this value is the expected false positive for each k-mer. In practice, a sequence (several k-mers) will have a way smaller false positive. ganon calculates the false positive rate of a query as suggested by (Solomon and Kingsford, 2016). The `--fpr-query` will control the max. value accepted to consider a match between a sequence and a reference, avoiding false positives that may be introduce by the properties of the data structure.

By default, `--fpr-query 1e-5` is used and it is applied after the `--rel-cutoff` and `--rel-filter`. Values between `1e-3` and `1e-10` are recommended. This threshold becomes more important when building smaller databases with higher `--max-fp`, assuring that the false positive is under control. In this case however, you may notice a in sensitivity of your results.
By default, `--fpr-query 1e-5` is used and it is applied after the `--rel-cutoff` and `--rel-filter`. Values between `1e-3` and `1e-10` are recommended. This threshold becomes more important when building smaller databases with higher `--max-fp`, assuring that the false positive is under control. In this case however, sensitivity of results may decrease.

!!! Note
The false positive of a query was first propose in: Solomon, Brad, and Carl Kingsford. “Fast Search of Thousands of Short-Read Sequencing Experiments.” Nature Biotechnology 34, no. 3 (2016): 1–6. https://doi.org/10.1038/nbt.3442.
14 changes: 7 additions & 7 deletions docs/custom_databases.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ tail -n+2 genomes-all_metadata.tsv | cut -f 1,20 | xargs -P 12 -n2 sh -c 'curl -
tail -n+2 genomes-all_metadata.tsv | cut -f 1,15 | tr ';' '\t' | awk -F"\t" '{tax="1";for(i=NF;i>1;i--){if(length($i)>3){tax=$i;break;}};print $1".fna.gz\t"$1"\t"tax}' > ganon_input_file.tsv

# Build ganon database
ganon build-custom --input-file ganon_input_file.tsv --db-prefix mgnify_human_oral_v1 --taxonomy gtdb --level leaves --threads 32
ganon build-custom --input-file ganon_input_file.tsv --db-prefix mgnify_human_oral_v1 --taxonomy gtdb --level leaves --hibf --threads 32
```

!!! note
Expand Down Expand Up @@ -202,15 +202,15 @@ db="16S_ribosomal_RNA"
threads=8

# Download BLAST db - re-run this command many times until all finish (no more output)
curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep "^${db}\..*tar.gz$" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/blast/db/{}"
curl --silent --list-only ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | grep "^${db}\..*tar.gz$" | xargs -P ${threads:-1} -I{} wget --continue -nd --quiet --show-progress "https://ftp.ncbi.nlm.nih.gov/blast/db/{}"

# OPTIONAL Download and check MD5
wget -O - -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/blast/db/${db}\.*tar.gz.md5" > "${db}.md5"
md5sum "${db}".*tar.gz > "${db}_downloaded.md5"
diff -s <(sort -k 2,2 "${db}.md5") <(sort -k 2,2 "${db}_downloaded.md5") # Should print "Files /dev/fd/xx and /dev/fd/xx are identical"
find -name "${db}.*tar.gz" -type f -printf '%P\n' | xargs -P ${threads:-1} -I{} md5sum {} > "${db}_downloaded.md5"
diff -sy <(sort -k 2,2 "${db}.md5") <(sort -k 2,2 "${db}_downloaded.md5") # Should print "Files /dev/fd/xx and /dev/fd/xx are identical"

# Extract BLAST db files, if successful, remove .tar.gz
ls "${db}"*.tar.gz | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > "${db}_extracted_files.txt"
find -name "${db}.*tar.gz" -type f -printf '%P\n' | xargs -P ${threads} -I{} sh -c 'gzip -dc {} | tar --overwrite -vxf - && rm {}' > "${db}_extracted_files.txt"

# Create folder to write sequence files (split into 10 sub-folders)
seq 0 9 | xargs -i mkdir -p "${db}"/{}
Expand All @@ -222,7 +222,7 @@ awk -v db="$(realpath ${db})" '{file=db"/"substr($2,1,1)"/"$2".fna"; print ">"$1
sort | uniq > "${db}_ganon_input_file.tsv"

# Build ganon database
ganon build-custom --input-file "${db}_ganon_input_file.tsv" --db-prefix "${db}" --level species --threads 12
ganon build-custom --input-file "${db}_ganon_input_file.tsv" --db-prefix "${db}" --hibf --level species --threads 12

# Delete extracted files and auxiliary files
cat "${db}_extracted_files.txt" | xargs rm
Expand All @@ -232,7 +232,7 @@ rm -rf "${db}" "${db}_ganon_input_file.tsv"
```

!!! note
`blastdbcmd` is a command from BLAST software suite and should be installed separately
`blastdbcmd` is a command from [BLAST+ software suite](https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) (tested version 2.14.0) and should be installed separately.

### Files from genome_updater

Expand Down
65 changes: 30 additions & 35 deletions docs/default_databases.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,44 +18,47 @@ Additionally, [custom databases](../custom_databases) can be built with customiz
!!! info
We DO NOT provide pre-built indices for download. ganon can build databases very efficiently. This way, you will always have up-to-date reference sequences and get most out of your data.


!!! warning
Databases are one of the most important elements when analyzing metagenomics data and should be chosen carefully based on the study data

## RefSeq and GenBank

NCBI RefSeq and GenBank repositories are common resources to obtain reference sequences to analyze metagenomics data. They are mainly divided into domains/organism groups (e.g. archaea, bacteria, fungi, ...) but can be further filtered in many ways. The choice of those filters can drastically change the outcome of results.
NCBI RefSeq and GenBank repositories are common resources to obtain reference sequences to analyze metagenomics data. They are mainly divided into domains/organism groups (e.g. archaea, bacteria, fungi, ...) but can be further filtered. The choice of those filters can drastically change the outcome of results.


### Commonly used sub-sets

| RefSeq | # assemblies | Size (GB) * | |
|---|---|---|---|
| Complete | 295219 | 160 - 500 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_rs`</details> |
| One assembly per species | 52779 | 40 - 98 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --genome-updater "-A 'species:1'" --db-prefix abfv_rs_t1s`</details> |
| Complete genomes (higher quality) | 44121 | 19 - 64 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_rs_cg`</details> |
| One assembly per species of complete genomes | 19713 | 8 - 27 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes "-A 'species:1'" --db-prefix abfv_rs_cg_t1s`</details> |
| One representative assembly per species | 18073 | 21 - 35 | <details><summary>cmd</summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --representative-genomes --db-prefix abfv_rs_rg`</details> |
| RefSeq (2023-03-14) | # assemblies | # species | Size* | `ganon build` |
|---|---|---|---|---|
| All genomes | 295219 | 52781 | 160 | <details><summary></summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --hibf --level species --db-prefix abfv_rs`</details> |
| All genomes - 1 assembly/species | 52781 | 52781 | 128 | <details><summary></summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --hibf --level species --genome-updater "-A 'species:1'" --db-prefix abfv_rs_t1s`</details> |
| Complete genomes | 44121 | 19715 | 35 | <details><summary></summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --hibf --level species --complete-genomes --db-prefix abfv_rs_cg`</details> |
| Complete genomes - 1 assembly/species | 19715 | 19715 | 29 | <details><summary></summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --hibf --level species --complete-genomes --genome-updater "-A 'species:1'" --db-prefix abfv_rs_cg_t1s`</details> |
| Representative genomes | 18073 | 18073 | 69 | <details><summary></summary>`ganon build --source refseq --organism-group archaea bacteria fungi viral --threads 48 --hibf --level species --representative-genomes --db-prefix abfv_rs_rg`</details> |

| GenBank (2023-03-14) | # assemblies | # species | Size* | `ganon build` |
|---|---|---|---|---|
| All genomes | 1595845 | 99505 | - | <details><summary></summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --hibf --level species --db-prefix abfv_gb`</details> |
| All genomes - 1 assembly/species | 99505 | 99505 | 300 | <details><summary></summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --hibf --level species --genome-updater "-A 'species:1'" --db-prefix abfv_gb_t1s`</details> |
| Complete genomes | 92917 | 34815 | 42 | <details><summary></summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg`</details> |
| Complete genomes - 1 assembly/species | 34815 | 34815 | 34 | <details><summary></summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes "-A 'species:1'" --db-prefix abfv_gb_cg_t1s`</details> |

!!! info
Data obtained in 2023-03-14 for archaea, bacteria, fungi and viral groups only. By the time you are reading this, those numbers certainly grew a bit. The commands provided will download up-to-date assemblies and will require slightly larger resources.

| GTDB R214 | # assemblies | # species | Size* | `ganon build` |
|---|---|---|---|---|
| All genomes | 402709 | 85205 | 260 | <details><summary></summary>`ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --hibf --level species --taxonomy gtdb --db-prefix ab_gtdb`</details> |
| All genomes - 1 assembly/species | 85205 | 85205 | 213 | <details><summary></summary>`ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --hibf --level species --taxonomy gtdb --top 1 --db-prefix ab_gtdb_t1s`</details> |

| GenBank | # assemblies | Size (GB) * | |
|---|---|---|---|
| Complete | 1595845 | - | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --db-prefix abfv_gb`</details> |
| One assembly per species | 99505 | 91 - 420 | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --genome-updater "-A 'species:1'" --db-prefix abfv_gb_t1s`</details> |
| Complete genomes (higher quality) | 92917 | 24 - 132 | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes --db-prefix abfv_gb_cg`</details> |
| One assembly per species of complete genomes | 34497 | 10 - 34 | <details><summary>cmd</summary>`ganon build --source genbank --organism-group archaea bacteria fungi viral --threads 48 --complete-genomes "-A 'species:1'" --db-prefix abfv_gb_cg_t1s`</details> |

\* Size (GB) is the final size of the database and the approximate amount of RAM necessary to build it (calculated with default parameters). The two values represent databases built with and without the `--hibf` parameter, respectively. The trade-offs between those two modes are explained [here](#hibf).
!!! info
GTDB covers only bacteria and archaea groups and has assemblies from RefSeq and GenBank.

!!! warning
The `# assemblies` were obtained on 2023-03-14 accounting for archaea, bacteria, fungi and viral groups only. By the time you are reading this, those numbers certainly grew a bit.
\* in GB -> the size of the database and the approx. RAM needed to build and use it.

- As a rule of thumb, the more the better, so choose the most comprehensive sub-set as possible given your computational resources
- Databases can have a fixed size/RAM usage with the `--filter-size` parameter. Beware that smaller filters will increase the false positive rates when classifying. Other approaches [can reduce the size/RAM requirements with some trade-offs](#reducing-database-size).
- Further examples of commonly used database can be found [here](../custom_databases/#examples).
- Alternatively, you can build one database for each organism group separately and use them in `ganon classify` in any order or even stack them hierarchically. This way combination of multiple databases are possible, extending use cases.
- Databases can have a fixed size/RAM usage with the `--filter-size` parameter (without `--hibf`). Beware that smaller filters will increase the false positive rates when classifying. Other approaches [can reduce the size/RAM requirements with some trade-offs](#reducing-database-size).
- Alternatively, you can build one database for each organism group separately and use them in `ganon classify` in [any order or even stack them hierarchically](../classification/#multiple-and-hierarchical-classification). This way combination of multiple databases are possible, extending use cases.

!!! tip
assemblies usually represent strains and can be used to do "strain-level classification"
Further examples of commonly used database can be found [here](../custom_databases/#examples).

### Specific organisms or taxonomic groups

Expand All @@ -79,22 +82,14 @@ will download top 3 archaeal assemblies for each genus with date before 2023-01-

## GTDB

By default, ganon will use the NCBI Taxonomy to build the database. However, [GTDB](gtdb.ecogenomic.org/) is fully supported and can be used with the parameter `--taxonomy gtdb`.

| GTDB R214 | # assemblies | Size (GB) | |
|---|---|---|---|
| Complete | 402709 | | <details><summary>cmd</summary>`ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --db-prefix ab_gtdb`</details> |
| One assembly for each species | 85205 | | <details><summary>cmd</summary>`ganon build --source refseq genbank --organism-group archaea bacteria --threads 48 --taxonomy gtdb --top 1 --db-prefix ab_gtdb_t1s`</details> |
By default, ganon will use the NCBI Taxonomy to build the database. However, [GTDB](gtdb.ecogenomic.org/) is fully supported and can be used with the parameter `--taxonomy gtdb`.

Filtering by taxonomic entries also work with GTDB, for example:

```bash
ganon build --db-prefix fuso_gtdb --taxid "f__Fusobacteriaceae" --source refseq genbank --taxonomy gtdb --threads 12
```

!!! info
GTDB covers only bacteria and archaea groups and has assemblies from both RefSeq and GenBank.

## Update (ganon update)

Default ganon databases generated with the `ganon build` can be updated with `ganon update`. This procedure will download new files and re-generate the ganon database with the updated entries.
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ git clone --branch raptor-v3.0.0 --recurse-submodules https://github.com/seqan/r
cd raptor
mkdir -p build
cd build
cmake ..
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-std=c++23 -Wno-interference-size" ..
make -j 4
```

Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ def read(filename):

setup(
name="ganon",
version="1.7.0",
version="1.8.0",
url="https://www.github.com/pirovc/ganon",
license='MIT',
author="Vitor C. Piro",
Expand Down
2 changes: 1 addition & 1 deletion src/ganon-build/GanonBuild.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ robin_hood::unordered_map< std::string, TFile > parse_input_file( const std::str

if ( fields.size() == 1 )
{
// target is the file itself (filename only wihtout path)
// target is the file itself (filename only without path)
const auto target = std::filesystem::path( file ).filename();
input_map[target][file] = {};
hashes_count[target] = 0;
Expand Down
29 changes: 19 additions & 10 deletions src/ganon-classify/GanonClassify.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -654,6 +654,16 @@ void write_report( TRep& rep, TTax& tax, std::ofstream& out_rep, std::string hie
}
}

static inline void replace_all( std::string& str, const std::string& from, const std::string& to )
{
size_t start_pos = 0;
while ( ( start_pos = str.find( from, start_pos ) ) != std::string::npos )
{
str.replace( start_pos, from.length(), to );
start_pos += to.length(); // Handles case where 'to' is a substring of 'from'
}
}

size_t load_filter( THIBF& filter,
IBFConfig& ibf_config,
TBinMap& bin_map,
Expand Down Expand Up @@ -693,18 +703,17 @@ size_t load_filter( THIBF& filter,
{
for ( auto const& filename : file_list )
{
// based on the filename, try to get only assembly accession (e.g.
// GCF_013391805.1_ASM1339180v1_genomic.fna.gz), otherwise use filename as target
// based on the filename, get target.minimiser
// (e.g. 562.minimiser or GCF_013391805.1.minimiser), otherwise use filename as target
auto f = std::filesystem::path( filename ).filename().string();
size_t found = f.find( '_' );
size_t found = f.find( ".minimiser" );
if ( found != std::string::npos )
{
found = f.find( '_', found + 1 );
if ( found != std::string::npos )
{
f = f.substr( 0, found );
}
}
f = f.substr( 0, found );

// workaround when file has a . (e.g. GCF_013391805.1)
// "." replaced by "|||" in ganon build wrapper
replace_all( f, "|||", "." );
replace_all( f, "---", " " );

bin_map.push_back( std::make_tuple( binno, f ) );
// same fpr for all
Expand Down
Loading

0 comments on commit 6558c70

Please sign in to comment.