Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifying the host genome for Kraken2 #113

Closed
taltman opened this issue Jun 4, 2020 · 4 comments
Closed

Modifying the host genome for Kraken2 #113

taltman opened this issue Jun 4, 2020 · 4 comments
Labels
documentation Improvements or additions to documentation enhancement Improvement for existing functionality

Comments

@taltman
Copy link

taltman commented Jun 4, 2020

Hi there,

A hearty hello from the Serratus project!

https://github.com/ababaian/serratus/

We are keen to work with you guys to integrate viralrecon into our effort to isolate as many distantly-related Coronaviruses as possible.

I'm the resident Kraken2 enthusiast there. Two questions:

  • What is the "base" Kraken2 DB used by default? Which genomes?
  • How do you provide alternate host genomes to Kraken2 in viralrecon? The docs weren't clear.
  • Is it easy to configure the pipeline to save the raw Kraken2 output files?

Thanks!

@drpatelh drpatelh added enhancement Improvement for existing functionality documentation Improvements or additions to documentation labels Jun 5, 2020
@drpatelh
Copy link
Member

drpatelh commented Jun 5, 2020

Hi @taltman! Hope you and loved ones are keeping well. Thanks for getting in touch. Yep, @ababaian and I have introduced ourselves on the #viralrecon channel on nf-core Slack. Serrattus looks really cool and we look forward to helping in any way we can to get involved with the downstream analysis!

The default Kraken2 database just contains the human genome. This will be downloaded from here and used by default unless you overwrite the --kraken2_db parameter.

At present, the implementation to build the Kraken 2 database is quite simple. You can provide your own --kraken2_db_name to the pipeline (default: 'human' ) but this is something that has to be recognised by Kraken 2 because it has a bunch of out-of-the-box keys it supports. See point 2. in the docs.

I have intentionally avoided trying to parameterise all of the options that can be used to build the Kraken 2 database because of the issues I have seen in the past when downloading and creating the database. Ideally, I would prefer that this is monitored and built outside of the pipeline as opposed to having silent failures where some files were not downloaded properly but yet you have a database you can use. Having said that, I was thinking of adding the parameters below to the pipeline to make things more flexible but didnt get around to it:

--kraken2_build_lib [str] Comma-separated list of Kraken2 recognised library names to build host database (Default: 'human')
--kraken2_build_fasta [file] Comma-separated list of fasta files used to build Kraken2 host database (Default: '')

To summarise, you can build the database however you choose by using custom fasta files mixed with standard ones and provide that to the pipeline with --kraken2_db. Anything that is not in the Kraken 2 database is used for the assembly processes which means that in theory if it contains multiple hosts then these reads will be filtered out.

More than happy to listen to ideas and for any contributions that you may have 🙂

@drpatelh
Copy link
Member

drpatelh commented Jun 5, 2020

Is it easy to configure the pipeline to save the raw Kraken2 output files?

The current outputs from the Kraken 2 process in the pipeline are listed here. You can save the raw fastq with the --save_kraken2_fastq but this isn't done by default to be storage friendly. Were there any other output files you were interested in?

@drpatelh
Copy link
Member

Given the issues creating Kraken2 databases when downloading files I think we should either encourage users to use the default human one shipped with the pipeline (now hosted on AWS for more consistent and better download speeds) or to build their own and provide to the pipeline with --kraken2_db. We will keep the process to build the Kraken2 database quite simple until it becomes easier to routinely create it.

@poursalavati
Copy link

Hello and thanks for your support.
I have almost two same questions!

Is there any robust way to prepare the kraken2 database for other hosts? As we are working on plants, I thought we might download the human database and then add our plant genome to that!?

Also if the kraken2 database here is just used to filter host reads, why didn't use other tools that have easier ways to prepare indexes like bwa and etc..

Cheers,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

3 participants