User-defined metabolism #1867

ivagljiva · 2021-12-26T23:25:16Z

This is it. The PR we've all been waiting for (lol). With these changes, anvi'o users can finally estimate metabolism for any arbitrary metabolic pathway that they define themselves.

A new program for setting up user data

The program anvi-setup-user-modules takes as input a directory of module files and produces a modules database (USER_MODULES.db) containing these pathways. For example:

anvi-setup-user-modules --input-dir USER_MODULES/

There are strict formatting requirements (which are detailed in the help pages for this program and the user-modules-data artifact). We are using the same format as KEGG MODULE, except that user module files have the additional requirement that each enzyme in the module DEFINITION must have a corresponding ANNOTATION_SOURCE line indicating which functional annotation source they belong to (ie, 'KOfam', 'COG20_FUNCTION', etc). The program includes some basic sanity checks, but if users fail to format the files properly, it may not be able to catch some issues. I think it needs extensive user testing so that people can find creative ways to break the parsing :)

Note that this program requires KEGG data to already be set up on the user's computer, because the user module identifiers are compared against the KEGG module identifiers to make sure there is no overlap. If the KEGG data to be used along with the user modules are located in a non-default directory, their location can be specified using the --kegg-data-dir parameter.

Support for arbitrary annotation sources and the `--add-to-functions-table` parameter for `anvi-run-hmms`

There is no single annotation program for user-defined modules. The aforementioned ANNOTATION_SOURCE value allows users to define their metabolic pathways using any functional annotation source they want. They can use enzymes annotated by standard functional annotation programs like anvi-run-ncbi-cogs, anvi-run-pfams, or anvi-run-kegg-kofams. If they have their own HMM profiles for their enzymes, they can use anvi-run-hmms - however, this requires the use of the new parameter --add-to-functions-table, which puts the HMM hits into the gene_functions table instead of the hmm_hits table in the database. Finally, if none of these situations apply, the user can import the required functional annotations using anvi-import-functions.

The critical thing is that the annotation source strings in the gene_functions table in the contigs database must match to those in the module file. There are detailed instructions on how to do this in the documentation, and there are sanity checks in place to make sure anvi-estimate-metabolism will give a clear error if the proper annotation sources are not found.

One final note: a minor change was made to anvi-script-pfam-accessions-to-hmms-directory so that it produces a genes.txt file containing the gene accession from the HMM profile itself, to make that file consistent with the annotation that actually get put into the database. This is so people can define metabolic pathways with the accessions from the genes.txt file and have them match.

Estimating metabolism on user data

To estimate the completeness of user-defined modules, use the --input-dir parameter with anvi-estimate-metabolism and pass the same directory that was given to anvi-setup-user-modules:

anvi-estimate-metabolism -c CONTIGS.db --input-dir USER_MODULES/

The program will estimate the completeness of the pathways in the USER_MODULES.db in addition to KEGG modules. The output file names and their headers have been changed to be generic (for instance, the kofam_hits output is now just hits output). In addition, the --kegg-output-modes header is now --output-modes.

A few technical notes on metabolism estimation

modules mode output still includes a warning when a KOfam profile is missing and unable to be annotated. However, KOfams are the only annotation source that this happens for - there is no warning for COGs, Pfams, or other sources.
parsing of module steps has been updated to be more generic, so it does not work exclusively for KOfams. However, the consequence of this is that the sanity checking can no longer be as robust because we cannot rely on all enzymes having a KOfam identifier and being exactly 6 characters long. Ideally, it should not be a problem, but it does mean that users will need to be careful when crafting their module definitions
when we were working only with KEGG data, we pulled KO function definitions from the ko_list.txt file. There is no such file for the more generic user data case, so instead the dictionary of enzyme to function definitions (self.ko_dict) gets enzymes from user-defined annotation sources added to it as follows:
- enzymes are taken initially from the user modules database (via the new modules db function get_ko_function_dict()). These are later supplemented by enzymes annotated in the current contigs database (within the init_hits_and_splits() function).
- this is somewhat more complex for multi-mode + matrix format, because we need the list of all functional annotations from all contigs databases (and the modules database). After the enzymes from the modules DB are added to the dictionary with get_ko_function_dict(), then we loop through each contigs db in the input file and add all of their unique functional annotations to the dictionary.
- this required the addition of a new function in the MetagenomeDescriptions class, get_functions_and_sequences_dicts_from_contigs_db(), which is almost equivalent to the function of the same name in the GenomeDescriptions class. It works, but note that I haven't tested it extensively.
to facilitate the use of two modules DBs at once, modules DB access is now limited to the following two places in the code:
- the initial sanity check for matching DB hashes, and loading of the annotation sources
- the init_data_from_modules_db function, which loads the modules table(s) into the self.all_modules_in_db attribute all at once, and also initializes the module paths and the compound lists (using modules DB functions)
- anything else that used to require modules DB access (or a modules DB function) now can be taken directly from the self.all_modules_in_db attribute or has its own function in the KeggEstimatorArgs class
- consolidating it this way also limits the amount of I/O that is necessary during program execution, which will hopefully result in a performance increase (but this has not been tested)

Test output

To ensure all my changes did not affect the module completeness scores, I downloaded the most recent version of KEGG on both master and user_modules_db, annotated the same contigs DB and estimated metabolism on it with each branch. Then I compared the corresponding columns in the resulting output file. The only differences were in the header (as expected, since I changed some column names) and in the gene caller ID order (for modules that had multiple copies of the same KO):

$ diff <(cut -f 3,9,11,12 master_modules.txt ) <(cut -f 1,8,13,14 new_modules.txt )
1c1
< kegg_module	module_completeness	kofam_hits_in_module	gene_caller_ids_in_module
---
> module	module_completeness	enzyme_hits_in_module	gene_caller_ids_in_module
13c13
< M00005	1.0	K00948,K00948	1523,926
---
> M00005	1.0	K00948,K00948	926,1523
23c23
< M00855	1.0	K00688,K00705,K00705,K01200,K01200,K01835	29,1597,314,1596,724,372
---
> M00855	1.0	K00688,K00705,K00705,K01200,K01200,K01835	29,314,1597,724,1596,372
55c55
< M00086	1.0	K01897,K01897,K01897,K01897	1452,799,740,794
---
> M00086	1.0	K01897,K01897,K01897,K01897	799,740,794,1452

So I am confident that the estimation algorithm is still working as expected.

A new completeness estimate and a new warning type

For a while, we've been concerned about the high completeness estimates for modules that share almost all of their enzymes with other modules, because these high scores may be misleading. But now anvi-estimate-metabolism is aware of which enzymes are shared and which are unique to a single module - and it uses the unique enzymes to compute a supplemental completion estimate. These are the new values in the 'modules' mode output file:

proportion_unique_enzymes_present column holds the proportion of enzymes unique to the module that are present in the sample (a default header)
enzymes_unique_to_module column holds comma-separated list of the unique enzyme accessions (default)
unique_enzymes_hit_counts column holds comma-separated list of the number of times each unique enzyme appears in the sample (in same order as enzymes_unique_to_module list) (default)
unique_enzymes_context_string column has a string describing how the proportion was calculated (ie "x of y unique enzymes"), to quickly distinguish between modules that have just a few unique enzymes and those that have a lot. This is not a default output header, but can be requested for "modules custom" mode.

In addition to these, there is a new type of warning: if an enzyme is shared between multiple modules, it gets a string in the warning column that lists which modules it belongs to. For example: K01491 is present in multiple modules: M00377/M00140/M00618

The death of `unique_id` and the `do_not_write_key_column` parameter

The unique_id column has been deprecated, because it was not useful and could even be confusing (because in multi-mode, there was a bug in which the unique ID started over for each database in the input file).

Since this value was the key of the data dictionary passed to utils.store_dict_as_TAB_delimited_file(), I had to add a new parameter to this function: do_not_write_key_column. This is a Boolean argument. When it is True, the key of the passed dictionary will not be printed as the first column in the resulting output file. This also required some minor changes to the AppendableFile.append() function which internally calls store_dict_as_TAB_delimited_file(), so that it could work with the new parameter.

For anyone interested, here is how I tested this change:

import anvio
from anvio import utils
d = {0: {"a": 1, "b": 2, "c": 3}, 1: {"a": 4, "b": 5, "c": 6}, 2: {"a": 1, "b": 2, "c": 3}, 3: {"a": 4, "b": 5, "c": 6}}
utils.store_dict_as_TAB_delimited_file(d, "with_key.txt")
utils.store_dict_as_TAB_delimited_file(d, "without_key.txt", do_not_write_key_column=True)
exit()

with_key.txt:

without_key.txt:

A migration script for modules databases - v2 to v3

Implementing these changes required a new version of the modules database. These are the changes:

the self table has a data_source key that can be USER or KEGG
the self table has an annotation_sources key that lists the annotation sources required for the modules within
the modules table is now called modules instead of kegg_modules

There is a migration script that people can use to migrate their old KEGG databases from v2 to v3. It changes the name of the modules table, sets data_source to KEGG, and sets annotation_sources to KOfam.

Documentation

Finally, the help pages for the relevant programs and artifacts have been created/updated. The user-modules-data page could probably use a realistic metabolic pathway example in addition to the toy example that is already there, but hopefully it is already clear. There is also a realistic example on the page for anvi-setup-user-modules.

The Infant Gut Tutorial will be out of date, so once we make the next anvi'o release I will need to fix that page. I could perhaps also add a notice box to the Nif MAG blog post, since those commands will become outdated as well.

That's all!

Thanks for reading this through :) As always, comments and suggestions are welcome!

…ser input sanity checks if there is user input

… modules folder

… orphan data dir for the user if they don't already have one

should probably be merged with the other run_hmmpress function into a more generic one at some point, but KEGG data includes some other steps like deleting intermediate profiles so for now I left them separated

…the KEGG default path

I think it will break things later if we want to use both KEGG and USER data at the same time, so probably the modules db setup needs to take this as a param and not as attribute. but now setup works to make a USER db (as long as you use --just-do-it to avoid KEGG format rules)

…s. this is much more robust

meren · 2021-12-31T13:39:52Z

Hey @ivagljiva,

Thank you very much for this amazing feature, and for your extensive documentation, which alone is a monumental work for the community. I hope it will get the recognition it deserves over time.

I have a few suggestions.

I wonder if --user-modules may have been a better parameter name for --input-dir here:

anvi-estimate-metabolism -c CONTIGS.db --input-dir USER_MODULES/

This is just pure gold:

There is no single annotation program for user-defined modules. The aforementioned ANNOTATION_SOURCE value allows users to define their metabolic pathways using any functional annotation source they want. They can use enzymes annotated by standard functional annotation programs like anvi-run-ncbi-cogs, anvi-run-pfams, or anvi-run-kegg-kofams. If they have their own HMM profiles for their enzymes, they can use anvi-run-hmms - however, this requires the use of the new parameter --add-to-functions-table, which puts the HMM hits into the gene_functions table instead of the hmm_hits table in the database. Finally, if none of these situations apply, the user can import the required functional annotations using anvi-import-functions.

I understand why KEGG data must already be set up on the user's computer for anvi-setup-user-modules to work, but I just wanted to make sure: someone can define their own modules, run them on a given contig-db, and run estimate metabolism without having to run anvi-run-kegg-kofams, right? I'm imagining a scenario where someone has a very metabolism definition and they are only interested in estimating its occurrence across many genomes/metagenomes: they do not have to run any KOfams at all, right?

Thank you again!

ivagljiva · 2021-12-31T14:08:42Z

Hi @meren! Thank you very much for taking a look!

I wonder if --user-modules may have been a better parameter name for --input-dir here:

That sounds like a good idea, and it is an easy fix :) I already changed it.

someone can define their own modules, run them on a given contig-db, and run estimate metabolism without having to run anvi-run-kegg-kofams, right?

Nope! At this time, anvi-estimate-metabolism first estimates on KEGG data, and then estimates on USER data. So the only two options are 1) KEGG only or 2) KEGG + USER, meaning that anvi-run-kegg-kofams is required.

But I am guessing that you want three options: 1) KEGG only, 2) USER only, 3) KEGG + USER. :)
That is a good idea, and I am on it.

meren · 2021-12-31T15:16:58Z

Nope! At this time, anvi-estimate-metabolism first estimates on KEGG data, and then estimates on USER data. So the only two options are 1) KEGG only or 2) KEGG + USER, meaning that anvi-run-kegg-kofams is required.

But I am guessing that you want three options: 1) KEGG only, 2) USER only, 3) KEGG + USER. :) That is a good idea, and I am on it.

I think this would be a great way to survey novel metabolisms across large number of environments rapidly and to get output files that require no filtering to get answers for specific questions only involve those novel user metabolic descriptions.

How are you planning to do it?

The default is the good 'ol default (KEGG), user modules are engaged with the --user-modules parameter that points to the user modules db (KEGG + USER), and perhaps the user-only mode can be specified with a --only-user-modules flag?

ivagljiva · 2021-12-31T15:29:11Z

The default is the good 'ol default (KEGG), user modules are engaged with the --user-modules parameter that points to the user modules db (KEGG + USER), and perhaps the user-only mode can be specified with a --only-user-modules flag?

Yes, this way :) It was either that, or force people to use both --user-modules and --kegg-data-dir when they want KEGG + USER, but the latter is less intuitive and also less easy to implement given the current setup.

… anvi-setup-user-modules

…h same purpose

…nly-user-modules

…nly one, or both

…e, and uses whatever modules database is provided as a parameter

…ded separately

ivagljiva · 2021-12-31T19:03:41Z

All done, documented and tested :) We now have the option to only run on USER data.

meren · 2021-12-31T23:49:55Z

You are the best! I think we should advertise for it among our colleagues and on Twitter to have some people test it. What do you think?

meren · 2021-12-31T23:50:25Z

(and I think we can merge it into the main branch so if there are bugs to discover, we start that process, too!).

ivagljiva · 2022-01-01T00:39:36Z

I think we should advertise for it among our colleagues and on Twitter to have some people test it. What do you think?

Agreed! I will merge it and post on the anvi'o slack :)

ivagljiva added 30 commits May 8, 2021 09:53

more generic name for modules db

6b51682

more generic name for modules table

bb0280a

increase modules db version

584a3c2

adjust params to allow user input directory

b536b56

user input param and sanity checks

e0dc969

attributes for user data paths

7fbdd99

sanity check function for existing user data

6777c65

only do kegg sanity checks if there is no user input folder, and do u…

83caa74

…ser input sanity checks if there is user input

we will require hmm profiles in HMMs folder and module definitions in…

f115a36

… modules folder

sanity check for existing user folder and subfolders

347dbd0

using --reset will delete previously generated files. also we make an…

624c6cb

… orphan data dir for the user if they don't already have one

fixy fixes

86bc4bd

skeleton function for user setup

9f39417

quotes

ddd81f8

sanity check for empty data folders

3b8cd29

function to hmmpress user profiles.

dc30ac7

should probably be merged with the other run_hmmpress function into a more generic one at some point, but KEGG data includes some other steps like deleting intermediate profiles so for now I left them separated

remove all profile indices too with --reset

0cb988b

function to make a modules dict for user mods

53635aa

now modules setup function takes the db path instead of always using …

626c8bb

…the KEGG default path

now we pass in the module file location as a parameter to the db clas…

5d08a69

…s. this is much more robust

data source attribute in modules db self table

2cf56d9

more generic name for modules table

e20c692

scrub references to KEGG from the modules db output

6dfa256

only do KEGG sanity checks when data source is KEGG

44c28b5

only print parsing errors when there are some

4868301

Merge branch 'master' into user_modules_db

bc8fc93

Merge branch 'master' into user_modules_db

bc70d72

Merge branch 'master' into user_modules_db

2b12ec6

Merge branch 'master' into user_modules_db

0f87a32

ivagljiva added the metabolism / modules database label Dec 26, 2021

--input-dir -> --user-modules

f03185d

ivagljiva added 18 commits December 31, 2021 10:16

make sure this sanity check only runs for anvi-setup-kegg-kofams, not…

a4d1885

… anvi-setup-user-modules

add --only-user-modules param

25e21c2

this looks better without parentheses

1bb83e3

sanity check: no --only-user-modules without user modules

92bbc40

QoL: screamy capitals in comments that delineate sections of code wit…

5482e33

…h same purpose

this needs to be an empty dictionary so we can add to it later if --o…

a67f0d5

…nly-user-modules

separate kegg data loading from user data loading so that we can do o…

02d8026

…nly one, or both

output metabolism data

0ae4c7f

convert this function into one that works on just one module at a tim…

b1b022a

…e, and uses whatever modules database is provided as a parameter

some reorganization so that information from each database can be loa…

126e80c

…ded separately

fixy

639f964

apply same db loading logic for JSON data case

c537b44

make sure it works for multi mode too

c39af77

multi mode needs this sanity check too

407c279

sanity check against --kegg-data-dir + --only-user-modules

57e17b9

clarify possibilities in anvi-estimate-metabolism doc

43e7f20

update situation in user-modules-data page

8089258

actually these need to be . for the list to render properly

ad27cc4

ivagljiva merged commit f5f9893 into master Jan 1, 2022

ivagljiva deleted the user_modules_db branch March 22, 2023 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User-defined metabolism #1867

User-defined metabolism #1867

ivagljiva commented Dec 26, 2021

meren commented Dec 31, 2021

ivagljiva commented Dec 31, 2021

meren commented Dec 31, 2021

ivagljiva commented Dec 31, 2021

ivagljiva commented Dec 31, 2021

meren commented Dec 31, 2021

meren commented Dec 31, 2021

ivagljiva commented Jan 1, 2022

User-defined metabolism #1867

User-defined metabolism #1867

Conversation

ivagljiva commented Dec 26, 2021

A new program for setting up user data

Support for arbitrary annotation sources and the --add-to-functions-table parameter for anvi-run-hmms

Estimating metabolism on user data

A few technical notes on metabolism estimation

Test output

A new completeness estimate and a new warning type

The death of unique_id and the do_not_write_key_column parameter

A migration script for modules databases - v2 to v3

Documentation

That's all!

meren commented Dec 31, 2021

ivagljiva commented Dec 31, 2021

meren commented Dec 31, 2021

ivagljiva commented Dec 31, 2021

ivagljiva commented Dec 31, 2021

meren commented Dec 31, 2021

meren commented Dec 31, 2021

ivagljiva commented Jan 1, 2022

Support for arbitrary annotation sources and the `--add-to-functions-table` parameter for `anvi-run-hmms`

The death of `unique_id` and the `do_not_write_key_column` parameter