Modifying the host genome for Kraken2 #113

taltman · 2020-06-04T21:16:30Z

Hi there,

A hearty hello from the Serratus project!

We are keen to work with you guys to integrate viralrecon into our effort to isolate as many distantly-related Coronaviruses as possible.

I'm the resident Kraken2 enthusiast there. Two questions:

What is the "base" Kraken2 DB used by default? Which genomes?
How do you provide alternate host genomes to Kraken2 in viralrecon? The docs weren't clear.
Is it easy to configure the pipeline to save the raw Kraken2 output files?

Thanks!

drpatelh · 2020-06-05T07:39:12Z

Hi @taltman! Hope you and loved ones are keeping well. Thanks for getting in touch. Yep, @ababaian and I have introduced ourselves on the #viralrecon channel on nf-core Slack. Serrattus looks really cool and we look forward to helping in any way we can to get involved with the downstream analysis!

The default Kraken2 database just contains the human genome. This will be downloaded from here and used by default unless you overwrite the --kraken2_db parameter.

At present, the implementation to build the Kraken 2 database is quite simple. You can provide your own --kraken2_db_name to the pipeline (default: 'human' ) but this is something that has to be recognised by Kraken 2 because it has a bunch of out-of-the-box keys it supports. See point 2. in the docs.

I have intentionally avoided trying to parameterise all of the options that can be used to build the Kraken 2 database because of the issues I have seen in the past when downloading and creating the database. Ideally, I would prefer that this is monitored and built outside of the pipeline as opposed to having silent failures where some files were not downloaded properly but yet you have a database you can use. Having said that, I was thinking of adding the parameters below to the pipeline to make things more flexible but didnt get around to it:

--kraken2_build_lib [str] Comma-separated list of Kraken2 recognised library names to build host database (Default: 'human')
--kraken2_build_fasta [file] Comma-separated list of fasta files used to build Kraken2 host database (Default: '')

To summarise, you can build the database however you choose by using custom fasta files mixed with standard ones and provide that to the pipeline with --kraken2_db. Anything that is not in the Kraken 2 database is used for the assembly processes which means that in theory if it contains multiple hosts then these reads will be filtered out.

More than happy to listen to ideas and for any contributions that you may have 🙂

drpatelh · 2020-06-05T07:43:28Z

Is it easy to configure the pipeline to save the raw Kraken2 output files?

The current outputs from the Kraken 2 process in the pipeline are listed here. You can save the raw fastq with the --save_kraken2_fastq but this isn't done by default to be storage friendly. Were there any other output files you were interested in?

drpatelh · 2021-04-27T12:53:12Z

Given the issues creating Kraken2 databases when downloading files I think we should either encourage users to use the default human one shipped with the pipeline (now hosted on AWS for more consistent and better download speeds) or to build their own and provide to the pipeline with --kraken2_db. We will keep the process to build the Kraken2 database quite simple until it becomes easier to routinely create it.

poursalavati · 2022-02-10T00:43:20Z

Hello and thanks for your support.
I have almost two same questions!

Is there any robust way to prepare the kraken2 database for other hosts? As we are working on plants, I thought we might download the human database and then add our plant genome to that!?

Also if the kraken2 database here is just used to filter host reads, why didn't use other tools that have easier ways to prepare indexes like bwa and etc..

Cheers,

drpatelh added enhancement Improvement for existing functionality documentation Improvements or additions to documentation labels Jun 5, 2020

drpatelh closed this as completed Apr 27, 2021

tetedange13 mentioned this issue Jan 12, 2023

Choose Kraken2-filtered host #350

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modifying the host genome for Kraken2 #113

Modifying the host genome for Kraken2 #113

taltman commented Jun 4, 2020

drpatelh commented Jun 5, 2020

drpatelh commented Jun 5, 2020

drpatelh commented Apr 27, 2021

poursalavati commented Feb 10, 2022

Modifying the host genome for Kraken2 #113

Modifying the host genome for Kraken2 #113

Comments

taltman commented Jun 4, 2020

drpatelh commented Jun 5, 2020

drpatelh commented Jun 5, 2020

drpatelh commented Apr 27, 2021

poursalavati commented Feb 10, 2022