-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User-defined metabolism #1867
User-defined metabolism #1867
Conversation
…ser input sanity checks if there is user input
… orphan data dir for the user if they don't already have one
should probably be merged with the other run_hmmpress function into a more generic one at some point, but KEGG data includes some other steps like deleting intermediate profiles so for now I left them separated
…the KEGG default path
I think it will break things later if we want to use both KEGG and USER data at the same time, so probably the modules db setup needs to take this as a param and not as attribute. but now setup works to make a USER db (as long as you use --just-do-it to avoid KEGG format rules)
…s. this is much more robust
Hey @ivagljiva, Thank you very much for this amazing feature, and for your extensive documentation, which alone is a monumental work for the community. I hope it will get the recognition it deserves over time. I have a few suggestions.
Thank you again! |
Hi @meren! Thank you very much for taking a look!
That sounds like a good idea, and it is an easy fix :) I already changed it.
Nope! At this time, But I am guessing that you want three options: 1) KEGG only, 2) USER only, 3) KEGG + USER. :) |
I think this would be a great way to survey novel metabolisms across large number of environments rapidly and to get output files that require no filtering to get answers for specific questions only involve those novel user metabolic descriptions. How are you planning to do it? The default is the good 'ol default (KEGG), user modules are engaged with the |
Yes, this way :) It was either that, or force people to use both |
… anvi-setup-user-modules
…e, and uses whatever modules database is provided as a parameter
All done, documented and tested :) We now have the option to only run on USER data. |
You are the best! I think we should advertise for it among our colleagues and on Twitter to have some people test it. What do you think? |
(and I think we can merge it into the main branch so if there are bugs to discover, we start that process, too!). |
Agreed! I will merge it and post on the anvi'o slack :) |
This is it. The PR we've all been waiting for (lol). With these changes, anvi'o users can finally estimate metabolism for any arbitrary metabolic pathway that they define themselves.
A new program for setting up user data
The program
anvi-setup-user-modules
takes as input a directory of module files and produces a modules database (USER_MODULES.db
) containing these pathways. For example:There are strict formatting requirements (which are detailed in the help pages for this program and the
user-modules-data
artifact). We are using the same format as KEGG MODULE, except that user module files have the additional requirement that each enzyme in the module DEFINITION must have a corresponding ANNOTATION_SOURCE line indicating which functional annotation source they belong to (ie, 'KOfam', 'COG20_FUNCTION', etc). The program includes some basic sanity checks, but if users fail to format the files properly, it may not be able to catch some issues. I think it needs extensive user testing so that people can find creative ways to break the parsing :)Note that this program requires KEGG data to already be set up on the user's computer, because the user module identifiers are compared against the KEGG module identifiers to make sure there is no overlap. If the KEGG data to be used along with the user modules are located in a non-default directory, their location can be specified using the
--kegg-data-dir
parameter.Support for arbitrary annotation sources and the
--add-to-functions-table
parameter foranvi-run-hmms
There is no single annotation program for user-defined modules. The aforementioned ANNOTATION_SOURCE value allows users to define their metabolic pathways using any functional annotation source they want. They can use enzymes annotated by standard functional annotation programs like
anvi-run-ncbi-cogs
,anvi-run-pfams
, oranvi-run-kegg-kofams
. If they have their own HMM profiles for their enzymes, they can useanvi-run-hmms
- however, this requires the use of the new parameter--add-to-functions-table
, which puts the HMM hits into thegene_functions
table instead of thehmm_hits
table in the database. Finally, if none of these situations apply, the user can import the required functional annotations usinganvi-import-functions
.The critical thing is that the annotation source strings in the
gene_functions
table in the contigs database must match to those in the module file. There are detailed instructions on how to do this in the documentation, and there are sanity checks in place to make sureanvi-estimate-metabolism
will give a clear error if the proper annotation sources are not found.One final note: a minor change was made to
anvi-script-pfam-accessions-to-hmms-directory
so that it produces agenes.txt
file containing the gene accession from the HMM profile itself, to make that file consistent with the annotation that actually get put into the database. This is so people can define metabolic pathways with the accessions from thegenes.txt
file and have them match.Estimating metabolism on user data
To estimate the completeness of user-defined modules, use the
--input-dir
parameter withanvi-estimate-metabolism
and pass the same directory that was given toanvi-setup-user-modules
:The program will estimate the completeness of the pathways in the
USER_MODULES.db
in addition to KEGG modules. The output file names and their headers have been changed to be generic (for instance, thekofam_hits
output is now justhits
output). In addition, the--kegg-output-modes
header is now--output-modes
.A few technical notes on metabolism estimation
ko_list.txt file
. There is no such file for the more generic user data case, so instead the dictionary of enzyme to function definitions (self.ko_dict
) gets enzymes from user-defined annotation sources added to it as follows:get_ko_function_dict()
). These are later supplemented by enzymes annotated in the current contigs database (within theinit_hits_and_splits()
function).get_ko_function_dict()
, then we loop through each contigs db in the input file and add all of their unique functional annotations to the dictionary.MetagenomeDescriptions
class,get_functions_and_sequences_dicts_from_contigs_db()
, which is almost equivalent to the function of the same name in theGenomeDescriptions
class. It works, but note that I haven't tested it extensively.init_data_from_modules_db
function, which loads the modules table(s) into theself.all_modules_in_db
attribute all at once, and also initializes the module paths and the compound lists (using modules DB functions)self.all_modules_in_db
attribute or has its own function in theKeggEstimatorArgs
classTest output
To ensure all my changes did not affect the module completeness scores, I downloaded the most recent version of KEGG on both
master
anduser_modules_db
, annotated the same contigs DB and estimated metabolism on it with each branch. Then I compared the corresponding columns in the resulting output file. The only differences were in the header (as expected, since I changed some column names) and in the gene caller ID order (for modules that had multiple copies of the same KO):So I am confident that the estimation algorithm is still working as expected.
A new completeness estimate and a new warning type
For a while, we've been concerned about the high completeness estimates for modules that share almost all of their enzymes with other modules, because these high scores may be misleading. But now
anvi-estimate-metabolism
is aware of which enzymes are shared and which are unique to a single module - and it uses the unique enzymes to compute a supplemental completion estimate. These are the new values in the 'modules' mode output file:proportion_unique_enzymes_present
column holds the proportion of enzymes unique to the module that are present in the sample (a default header)enzymes_unique_to_module
column holds comma-separated list of the unique enzyme accessions (default)unique_enzymes_hit_counts
column holds comma-separated list of the number of times each unique enzyme appears in the sample (in same order as enzymes_unique_to_module list) (default)unique_enzymes_context_string
column has a string describing how the proportion was calculated (ie "x of y unique enzymes"), to quickly distinguish between modules that have just a few unique enzymes and those that have a lot. This is not a default output header, but can be requested for "modules custom" mode.In addition to these, there is a new type of warning: if an enzyme is shared between multiple modules, it gets a string in the
warning
column that lists which modules it belongs to. For example:K01491 is present in multiple modules: M00377/M00140/M00618
The death of
unique_id
and thedo_not_write_key_column
parameterThe
unique_id
column has been deprecated, because it was not useful and could even be confusing (because in multi-mode, there was a bug in which the unique ID started over for each database in the input file).Since this value was the key of the data dictionary passed to
utils.store_dict_as_TAB_delimited_file()
, I had to add a new parameter to this function:do_not_write_key_column
. This is a Boolean argument. When it isTrue
, the key of the passed dictionary will not be printed as the first column in the resulting output file. This also required some minor changes to theAppendableFile.append()
function which internally callsstore_dict_as_TAB_delimited_file()
, so that it could work with the new parameter.For anyone interested, here is how I tested this change:
with_key.txt
:without_key.txt
:A migration script for modules databases - v2 to v3
Implementing these changes required a new version of the modules database. These are the changes:
data_source
key that can beUSER
orKEGG
annotation_sources
key that lists the annotation sources required for the modules withinmodules
instead ofkegg_modules
There is a migration script that people can use to migrate their old KEGG databases from v2 to v3. It changes the name of the modules table, sets
data_source
toKEGG
, and setsannotation_sources
toKOfam
.Documentation
Finally, the help pages for the relevant programs and artifacts have been created/updated. The
user-modules-data
page could probably use a realistic metabolic pathway example in addition to the toy example that is already there, but hopefully it is already clear. There is also a realistic example on the page foranvi-setup-user-modules
.The Infant Gut Tutorial will be out of date, so once we make the next anvi'o release I will need to fix that page. I could perhaps also add a notice box to the Nif MAG blog post, since those commands will become outdated as well.
That's all!
Thanks for reading this through :) As always, comments and suggestions are welcome!