Skip to content

Commit

Permalink
Merge pull request #250 from pirovc/dev
Browse files Browse the repository at this point in the history
ganon v1.6.0
  • Loading branch information
pirovc committed May 10, 2023
2 parents 1d6be3b + 9435c4f commit 4472217
Show file tree
Hide file tree
Showing 30 changed files with 1,809 additions and 1,206 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -43,3 +43,6 @@
# python setuptools
*.egg-info/
dist/

# mkdocs
site/
8 changes: 7 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# =============================================================================

cmake_minimum_required( VERSION 3.10 FATAL_ERROR )
project( ganon VERSION 1.5.1 LANGUAGES CXX )
project( ganon VERSION 1.6.0 LANGUAGES CXX )

# -----------------------------------------------------------------------------
# build setup
Expand All @@ -23,6 +23,7 @@ endif()

option( VERBOSE_CONFIG "Verbose mode for quick build setup debugging" OFF )
option( CONDA "Flag for compilation in conda env." OFF )
option( LONGREADS "Uses uint32_t for count in ganon-classify. Useful for very long reads (>65535bp)" OFF )
option( INCLUDE_DIRS "Include directories to look for libraries" "" )

# -----------------------------------------------------------------------------
Expand Down Expand Up @@ -81,6 +82,10 @@ if ( NOT CONDA )
add_compile_options( -static -march=native )
endif()

if( LONGREADS )
add_compile_options(-DLONGREADS)
endif()

# -----------------------------------------------------------------------------
# dependencies and 3rd party libraries
# -----------------------------------------------------------------------------
Expand Down Expand Up @@ -149,6 +154,7 @@ if( VERBOSE_CONFIG )
message( STATUS " INCLUDE_DIRS : ${INCLUDE_DIRS}" )
message( STATUS " CMAKE_INSTALL_PREFIX: ${CMAKE_INSTALL_PREFIX}" )
message( STATUS " CONDA : ${CONDA}" )
message( STATUS " LONGREADS : ${LONGREADS}" )
get_directory_property( dirCompileOptions COMPILE_OPTIONS )
message( STATUS " COMPILE_OPTIONS : ${dirCompileOptions}" )

Expand Down
977 changes: 18 additions & 959 deletions README.md

Large diffs are not rendered by default.

131 changes: 131 additions & 0 deletions docs/classification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Classification

`ganon classify` will match single and/or paired-end reads against one or [more databases](#multiple-and-hierarchical-classification), for example:

```bash
ganon classify --db-prefix my_db --paired-reads reads.1.fq.gz reads.2.fq.gz --output-prefix results --threads 32
```

`ganon report` will be automatically executed after classification and a report will be created (`.tre`).

ganon can generate both taxonomic profiling and binning results with `ganon classify` + `ganon report`. Please choose the parameters according to your application.

### Profiling

`ganon classify` is set-up by default to perform taxonomic profiling. It uses:

- strict `--rel-cutoff` and `--rel-filter` values (`0.75` and `0`, respectively)
- `--min-count 0.0001` (0.01%) on `ganon report` to exclude low abundant groups
- `--report-type abundance` on `ganon report` to generate taxonomic abundances, re-distributing read counts and correcting for genome sizes

### Binning

To achieve better results for binning reads to specific references, ganon can be configured with:

- `--output-all` and `--output-lca` to write `.all` `.lca` files for binning results
- less strict `--rel-cutoff` and `--rel-filter` values (e.g. `0.25` and `0.1`, respectively)
- activate the `--reassign` on `ganon classify` (or use the `ganon reassign` procedure) to apply a EM algorithm, re-assigning reads with LCA matches to most probable target (`--level` the database was built)

!!! tip
Higher `--kmer-size` values on `ganon build` can also improve read binning sensitivity

## Multiple and Hierarchical classification

`ganon classify` can be performed in multiple databases at the same time. The databases can also be provided in a hierarchical order.

Multiple database classification can be performed providing several inputs for `--db-prefix`. They are required to be built with the same `--kmer-size` and `--window-size` values. Multiple databases are considered as one (as if built together) and redundancy in content (same reference in two or more databases) is allowed.

To classify reads in a hierarchical order, `--hierarchy-labels` should be provided. When using multiple hierarchical levels, output files will be generated for each level (use `--output-single` to generate a single output from multiple hierarchical levels). Please note that some parameters are set for each database (e.g. `--rel-cutoff`) while others are set for each hierarchical level (e.g. `--rel-filter`)

<details>
<summary>Examples</summary>
<br>
Classification against 3 database (as if they were one) using the same cutoff:

```bash
ganon classify --db-prefix db1 db2 db3 \
--rel-cutoff 0.75 \
--single-reads reads.fq.gz
```

Classification against 3 database (as if they were one) using different error rates for each:

```bash
ganon classify --db-prefix db1 db2 db3 \
--rel-cutoff 0.2 0.3 0.1 \
--single-reads reads.fq.gz
```

In this example, reads are going to be classified first against db1 and db2. Reads without a valid match will be further classified against db3. `--hierarchy-labels` are strings and are going to be sorted to define the hierarchy order, disregarding input order:

```bash
ganon classify --db-prefix db1 db2 db3 \
--hierarchy-labels 1_first 1_first 2_second \
--single-reads reads.fq.gz
```

In this example, classification will be performed with different `--rel-cutoff` for each database. For each hierarchy levels (`1_first` and `2_second`) a different `--rel-filter` will be used:

```bash
ganon classify --db-prefix db1 db2 db3 \
--hierarchy-labels 1_first 1_first 2_second \
--rel-cutoff 1 0.5 0.25 \
--rel-filter 0.1 0.5 \
--single-reads reads.fq.gz
```

</details>
<br>

## Parameter details

### reads (--single-reads, --paired-reads)

ganon accepts single-end and paired-end reads. Both types can be use at the same time. In paired-end mode, reads are always reported with the header of the first pair. Paired-end reads are classified in a standard forward-reverse orientation. The max. read length accepted is 65535 (accounting both reads in paired mode).

### cutoff and filter (--rel-cutoff, --rel-filter)

ganon has two parameters to control a match between reads and references: `--rel-cutoff` and `--rel-filter`.

Every read can be classified against none, one or more references. What will be reported is the remaining matches after `cutoff` and `filter` thresholds are applied, based on the number of shared minimizers (or k-mers) between sequences.

The `cutoff` is the first. It should be set as a minimal value to consider a match between a read and a reference. Next the `filter` is applied to the remaining matches. `filter` thresholds are relative to the best scoring match and control how far from the best match further matches are allowed. `cutoff` can be interpreted as the lower bound to discard spurious matches and `filter` as the fine tuning to control what to keep.

For example, using `--kmer-size 19` (and `--window-size 19` to simplify the example), a certain read (100bp) has the following matches with the 5 references (`ref1..5`), sorted by shared k-mers:

| reference | shared k-mers |
|-----------|---------------|
| ref1 | 82 |
| ref2 | 68 |
| ref3 | 44 |
| ref4 | 25 |
| ref5 | 20 |

this read can have at most 82 shared k-mers (`100-19+1=82`). With `--rel-cutoff 0.25`, the following matches will be discarded:

| reference | shared k-mers | --rel-cutoff 0.25 |
|-----------|---------------|-------------------|
| ref1 | 82 | |
| ref2 | 68 | |
| ref3 | 44 | |
| ref4 | 25 | |
| ~~ref5~~ | ~~20~~ | X |

since the `--rel-cutoff` threshold is `82 * 0.25 = 21` (ceiling is applied). Further, with `--rel-filter 0.3`, the following matches will be discarded:

| reference | shared k-mers | --rel-cutoff 0.25 | --rel-filter 0.3 |
|-----------|---------------|-------------------|------------------|
| ref1 | 82 | | |
| ref2 | 68 | | |
| ~~ref3~~ | ~~44~~ | | X |
| ~~ref4~~ | ~~25~~ | | X |
| ~~ref5~~ | ~~20~~ | X | |


since best match is 82, the filter parameter is removing any match below `0.3 * 82 = 57` (ceiling is applied) shared k-mers. `ref1` and `ref2` are reported as matches.

For databases built with `--window-size`, the relative values are not based on the maximum number of possible shared k-mers but on the actual number of unique minimizers extracted from the read.

A different `cutoff` can be set for every database in a multiple or hierarchical database classification. A different `filter` can be set for every level of a hierarchical database classification.

Note that reads that remain with only one reference match (after `cutoff` and `filter` are applied) are considered a unique match.
Loading

0 comments on commit 4472217

Please sign in to comment.