Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-defined metabolism #1867

Merged
merged 186 commits into from
Jan 1, 2022
Merged

User-defined metabolism #1867

merged 186 commits into from
Jan 1, 2022

Conversation

ivagljiva
Copy link
Contributor

This is it. The PR we've all been waiting for (lol). With these changes, anvi'o users can finally estimate metabolism for any arbitrary metabolic pathway that they define themselves.

A new program for setting up user data

The program anvi-setup-user-modules takes as input a directory of module files and produces a modules database (USER_MODULES.db) containing these pathways. For example:

anvi-setup-user-modules --input-dir USER_MODULES/

There are strict formatting requirements (which are detailed in the help pages for this program and the user-modules-data artifact). We are using the same format as KEGG MODULE, except that user module files have the additional requirement that each enzyme in the module DEFINITION must have a corresponding ANNOTATION_SOURCE line indicating which functional annotation source they belong to (ie, 'KOfam', 'COG20_FUNCTION', etc). The program includes some basic sanity checks, but if users fail to format the files properly, it may not be able to catch some issues. I think it needs extensive user testing so that people can find creative ways to break the parsing :)

Note that this program requires KEGG data to already be set up on the user's computer, because the user module identifiers are compared against the KEGG module identifiers to make sure there is no overlap. If the KEGG data to be used along with the user modules are located in a non-default directory, their location can be specified using the --kegg-data-dir parameter.

Support for arbitrary annotation sources and the --add-to-functions-table parameter for anvi-run-hmms

There is no single annotation program for user-defined modules. The aforementioned ANNOTATION_SOURCE value allows users to define their metabolic pathways using any functional annotation source they want. They can use enzymes annotated by standard functional annotation programs like anvi-run-ncbi-cogs, anvi-run-pfams, or anvi-run-kegg-kofams. If they have their own HMM profiles for their enzymes, they can use anvi-run-hmms - however, this requires the use of the new parameter --add-to-functions-table, which puts the HMM hits into the gene_functions table instead of the hmm_hits table in the database. Finally, if none of these situations apply, the user can import the required functional annotations using anvi-import-functions.

The critical thing is that the annotation source strings in the gene_functions table in the contigs database must match to those in the module file. There are detailed instructions on how to do this in the documentation, and there are sanity checks in place to make sure anvi-estimate-metabolism will give a clear error if the proper annotation sources are not found.

One final note: a minor change was made to anvi-script-pfam-accessions-to-hmms-directory so that it produces a genes.txt file containing the gene accession from the HMM profile itself, to make that file consistent with the annotation that actually get put into the database. This is so people can define metabolic pathways with the accessions from the genes.txt file and have them match.

Estimating metabolism on user data

To estimate the completeness of user-defined modules, use the --input-dir parameter with anvi-estimate-metabolism and pass the same directory that was given to anvi-setup-user-modules:

anvi-estimate-metabolism -c CONTIGS.db --input-dir USER_MODULES/ 

The program will estimate the completeness of the pathways in the USER_MODULES.db in addition to KEGG modules. The output file names and their headers have been changed to be generic (for instance, the kofam_hits output is now just hits output). In addition, the --kegg-output-modes header is now --output-modes.

A few technical notes on metabolism estimation

  • modules mode output still includes a warning when a KOfam profile is missing and unable to be annotated. However, KOfams are the only annotation source that this happens for - there is no warning for COGs, Pfams, or other sources.
  • parsing of module steps has been updated to be more generic, so it does not work exclusively for KOfams. However, the consequence of this is that the sanity checking can no longer be as robust because we cannot rely on all enzymes having a KOfam identifier and being exactly 6 characters long. Ideally, it should not be a problem, but it does mean that users will need to be careful when crafting their module definitions
  • when we were working only with KEGG data, we pulled KO function definitions from the ko_list.txt file. There is no such file for the more generic user data case, so instead the dictionary of enzyme to function definitions (self.ko_dict) gets enzymes from user-defined annotation sources added to it as follows:
    • enzymes are taken initially from the user modules database (via the new modules db function get_ko_function_dict()). These are later supplemented by enzymes annotated in the current contigs database (within the init_hits_and_splits() function).
    • this is somewhat more complex for multi-mode + matrix format, because we need the list of all functional annotations from all contigs databases (and the modules database). After the enzymes from the modules DB are added to the dictionary with get_ko_function_dict(), then we loop through each contigs db in the input file and add all of their unique functional annotations to the dictionary.
    • this required the addition of a new function in the MetagenomeDescriptions class, get_functions_and_sequences_dicts_from_contigs_db(), which is almost equivalent to the function of the same name in the GenomeDescriptions class. It works, but note that I haven't tested it extensively.
  • to facilitate the use of two modules DBs at once, modules DB access is now limited to the following two places in the code:
    • the initial sanity check for matching DB hashes, and loading of the annotation sources
    • the init_data_from_modules_db function, which loads the modules table(s) into the self.all_modules_in_db attribute all at once, and also initializes the module paths and the compound lists (using modules DB functions)
    • anything else that used to require modules DB access (or a modules DB function) now can be taken directly from the self.all_modules_in_db attribute or has its own function in the KeggEstimatorArgs class
    • consolidating it this way also limits the amount of I/O that is necessary during program execution, which will hopefully result in a performance increase (but this has not been tested)

Test output

To ensure all my changes did not affect the module completeness scores, I downloaded the most recent version of KEGG on both master and user_modules_db, annotated the same contigs DB and estimated metabolism on it with each branch. Then I compared the corresponding columns in the resulting output file. The only differences were in the header (as expected, since I changed some column names) and in the gene caller ID order (for modules that had multiple copies of the same KO):

$ diff <(cut -f 3,9,11,12 master_modules.txt ) <(cut -f 1,8,13,14 new_modules.txt )
1c1
< kegg_module	module_completeness	kofam_hits_in_module	gene_caller_ids_in_module
---
> module	module_completeness	enzyme_hits_in_module	gene_caller_ids_in_module
13c13
< M00005	1.0	K00948,K00948	1523,926
---
> M00005	1.0	K00948,K00948	926,1523
23c23
< M00855	1.0	K00688,K00705,K00705,K01200,K01200,K01835	29,1597,314,1596,724,372
---
> M00855	1.0	K00688,K00705,K00705,K01200,K01200,K01835	29,314,1597,724,1596,372
55c55
< M00086	1.0	K01897,K01897,K01897,K01897	1452,799,740,794
---
> M00086	1.0	K01897,K01897,K01897,K01897	799,740,794,1452

So I am confident that the estimation algorithm is still working as expected.

A new completeness estimate and a new warning type

For a while, we've been concerned about the high completeness estimates for modules that share almost all of their enzymes with other modules, because these high scores may be misleading. But now anvi-estimate-metabolism is aware of which enzymes are shared and which are unique to a single module - and it uses the unique enzymes to compute a supplemental completion estimate. These are the new values in the 'modules' mode output file:

  • proportion_unique_enzymes_present column holds the proportion of enzymes unique to the module that are present in the sample (a default header)
  • enzymes_unique_to_module column holds comma-separated list of the unique enzyme accessions (default)
  • unique_enzymes_hit_counts column holds comma-separated list of the number of times each unique enzyme appears in the sample (in same order as enzymes_unique_to_module list) (default)
  • unique_enzymes_context_string column has a string describing how the proportion was calculated (ie "x of y unique enzymes"), to quickly distinguish between modules that have just a few unique enzymes and those that have a lot. This is not a default output header, but can be requested for "modules custom" mode.

In addition to these, there is a new type of warning: if an enzyme is shared between multiple modules, it gets a string in the warning column that lists which modules it belongs to. For example: K01491 is present in multiple modules: M00377/M00140/M00618

The death of unique_id and the do_not_write_key_column parameter

The unique_id column has been deprecated, because it was not useful and could even be confusing (because in multi-mode, there was a bug in which the unique ID started over for each database in the input file).

Since this value was the key of the data dictionary passed to utils.store_dict_as_TAB_delimited_file(), I had to add a new parameter to this function: do_not_write_key_column. This is a Boolean argument. When it is True, the key of the passed dictionary will not be printed as the first column in the resulting output file. This also required some minor changes to the AppendableFile.append() function which internally calls store_dict_as_TAB_delimited_file(), so that it could work with the new parameter.

For anyone interested, here is how I tested this change:

import anvio
from anvio import utils
d = {0: {"a": 1, "b": 2, "c": 3}, 1: {"a": 4, "b": 5, "c": 6}, 2: {"a": 1, "b": 2, "c": 3}, 3: {"a": 4, "b": 5, "c": 6}}
utils.store_dict_as_TAB_delimited_file(d, "with_key.txt")
utils.store_dict_as_TAB_delimited_file(d, "without_key.txt", do_not_write_key_column=True)
exit()

with_key.txt:

key	a	b	c
0	1	2	3
1	4	5	6
2	1	2	3
3	4	5	6

without_key.txt:

a	b	c
1	2	3
4	5	6
1	2	3
4	5	6

A migration script for modules databases - v2 to v3

Implementing these changes required a new version of the modules database. These are the changes:

  • the self table has a data_source key that can be USER or KEGG
  • the self table has an annotation_sources key that lists the annotation sources required for the modules within
  • the modules table is now called modules instead of kegg_modules

There is a migration script that people can use to migrate their old KEGG databases from v2 to v3. It changes the name of the modules table, sets data_source to KEGG, and sets annotation_sources to KOfam.

Documentation

Finally, the help pages for the relevant programs and artifacts have been created/updated. The user-modules-data page could probably use a realistic metabolic pathway example in addition to the toy example that is already there, but hopefully it is already clear. There is also a realistic example on the page for anvi-setup-user-modules.

The Infant Gut Tutorial will be out of date, so once we make the next anvi'o release I will need to fix that page. I could perhaps also add a notice box to the Nif MAG blog post, since those commands will become outdated as well.

That's all!

Thanks for reading this through :) As always, comments and suggestions are welcome!

…ser input sanity checks if there is user input
… orphan data dir for the user if they don't already have one
should probably be merged with the other run_hmmpress function into a 
more generic one at some point, but KEGG data includes some other steps 
like deleting intermediate profiles so for now I left them separated
I think it will break things later if we want to use both KEGG and USER 
data at the same time, so probably the modules db setup needs to take 
this as a param and not as attribute. 

but now setup works to make a USER db (as long as you use --just-do-it 
to avoid KEGG format rules)
@meren
Copy link
Member

meren commented Dec 31, 2021

Hey @ivagljiva,

Thank you very much for this amazing feature, and for your extensive documentation, which alone is a monumental work for the community. I hope it will get the recognition it deserves over time.

I have a few suggestions.

  • I wonder if --user-modules may have been a better parameter name for --input-dir here:
anvi-estimate-metabolism -c CONTIGS.db --input-dir USER_MODULES/ 
  • This is just pure gold:

There is no single annotation program for user-defined modules. The aforementioned ANNOTATION_SOURCE value allows users to define their metabolic pathways using any functional annotation source they want. They can use enzymes annotated by standard functional annotation programs like anvi-run-ncbi-cogs, anvi-run-pfams, or anvi-run-kegg-kofams. If they have their own HMM profiles for their enzymes, they can use anvi-run-hmms - however, this requires the use of the new parameter --add-to-functions-table, which puts the HMM hits into the gene_functions table instead of the hmm_hits table in the database. Finally, if none of these situations apply, the user can import the required functional annotations using anvi-import-functions.

  • I understand why KEGG data must already be set up on the user's computer for anvi-setup-user-modules to work, but I just wanted to make sure: someone can define their own modules, run them on a given contig-db, and run estimate metabolism without having to run anvi-run-kegg-kofams, right? I'm imagining a scenario where someone has a very metabolism definition and they are only interested in estimating its occurrence across many genomes/metagenomes: they do not have to run any KOfams at all, right?

Thank you again!

@ivagljiva
Copy link
Contributor Author

Hi @meren! Thank you very much for taking a look!

I wonder if --user-modules may have been a better parameter name for --input-dir here:

That sounds like a good idea, and it is an easy fix :) I already changed it.

someone can define their own modules, run them on a given contig-db, and run estimate metabolism without having to run anvi-run-kegg-kofams, right?

Nope! At this time, anvi-estimate-metabolism first estimates on KEGG data, and then estimates on USER data. So the only two options are 1) KEGG only or 2) KEGG + USER, meaning that anvi-run-kegg-kofams is required.

But I am guessing that you want three options: 1) KEGG only, 2) USER only, 3) KEGG + USER. :)
That is a good idea, and I am on it.

@meren
Copy link
Member

meren commented Dec 31, 2021

Nope! At this time, anvi-estimate-metabolism first estimates on KEGG data, and then estimates on USER data. So the only two options are 1) KEGG only or 2) KEGG + USER, meaning that anvi-run-kegg-kofams is required.

But I am guessing that you want three options: 1) KEGG only, 2) USER only, 3) KEGG + USER. :) That is a good idea, and I am on it.

I think this would be a great way to survey novel metabolisms across large number of environments rapidly and to get output files that require no filtering to get answers for specific questions only involve those novel user metabolic descriptions.

How are you planning to do it?

The default is the good 'ol default (KEGG), user modules are engaged with the --user-modules parameter that points to the user modules db (KEGG + USER), and perhaps the user-only mode can be specified with a --only-user-modules flag?

@ivagljiva
Copy link
Contributor Author

The default is the good 'ol default (KEGG), user modules are engaged with the --user-modules parameter that points to the user modules db (KEGG + USER), and perhaps the user-only mode can be specified with a --only-user-modules flag?

Yes, this way :) It was either that, or force people to use both --user-modules and --kegg-data-dir when they want KEGG + USER, but the latter is less intuitive and also less easy to implement given the current setup.

@ivagljiva
Copy link
Contributor Author

All done, documented and tested :) We now have the option to only run on USER data.

@meren
Copy link
Member

meren commented Dec 31, 2021

You are the best! I think we should advertise for it among our colleagues and on Twitter to have some people test it. What do you think?

@meren
Copy link
Member

meren commented Dec 31, 2021

(and I think we can merge it into the main branch so if there are bugs to discover, we start that process, too!).

@ivagljiva
Copy link
Contributor Author

I think we should advertise for it among our colleagues and on Twitter to have some people test it. What do you think?

Agreed! I will merge it and post on the anvi'o slack :)

@ivagljiva ivagljiva merged commit f5f9893 into master Jan 1, 2022
@ivagljiva ivagljiva deleted the user_modules_db branch March 22, 2023 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants