Skip to content

Commit

Permalink
Merge pull request #2121 from merenlab/kegg_download_consolidation
Browse files Browse the repository at this point in the history
Consolidating our KEGG data download
  • Loading branch information
semiller10 committed Sep 25, 2023
2 parents 9b88802 + fb10435 commit 66d6493
Show file tree
Hide file tree
Showing 26 changed files with 1,175 additions and 770 deletions.
1 change: 1 addition & 0 deletions .conda/environment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,4 @@ dependencies:
- r-magrittr
- bioconductor-qvalue
- fastani
- meme
5 changes: 4 additions & 1 deletion .github/workflows/daily-component-tests-and-migrations.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,13 @@ jobs:
anvi-self-test --suite metagenomics-full --no-interactive
anvi-self-test --suite pangenomics --no-interactive
anvi-self-test --suite inversions --no-interactive
anvi-self-test --suite metabolism --no-interactive
# the following steps cause our actions to fail on GitHub runners
# due to space limitations :/ please do not uncomment this until we
# have a solution for this :/
#- name: "Run component tests for metabolism framework"
# shell: bash -l {0}
# run: |
# anvi-self-test --suite metabolism --no-interactive
#- name: "Migrate ancient anvi'o databases"
# shell: bash -l {0}
# run: |
Expand Down
2 changes: 1 addition & 1 deletion Dockerfiles/anvio-structure/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ RUN rm anvio-7.1.tar.gz
# Setup anvi'o databases
##############################################################
RUN anvi-setup-interacdome
RUN anvi-setup-kegg-kofams --kegg-snapshot v2020-12-23
RUN anvi-setup-kegg-data --kegg-snapshot v2020-12-23
RUN anvi-setup-pfams --pfam-version 33.1
RUN anvi-setup-ncbi-cogs --cog-version COG20

Expand Down
28 changes: 14 additions & 14 deletions anvio/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1044,37 +1044,37 @@ def TABULATE(table, header, numalign="right", max_width=0):
"you will not have the most up-to-date version of KEGG for your annotations, metabolism "
"estimations, or any other downstream uses of this data. If that is going to be a problem for you, "
"do not fear - you can provide this flag to tell anvi'o to download the latest, freshest data directly "
"from KEGG's REST API and set it up into an anvi'o-compatible database."}
"from KEGG's REST API and set it up into anvi'o-compatible files."}
),
'only-download': (
['--only-download'],
{'default': False,
'action': 'store_true',
'help': "You want this program to only download data from KEGG, and then stop. It will not "
"make a modules database. (It would be a *very* good idea for you to specify a "
"data directory using --kegg-data-dir in this case, so that you can find the resulting "
"data easily and avoid messing up any data in the default KEGG directory. But you are "
"of course free to do whatever you want.). Note that KOfam profiles will still be "
"processed with `hmmpress` if you choose this option."}
"process the data (ie, into organized HMMs or a modules database). (It would be a "
"*very* good idea for you to specify a data directory using --kegg-data-dir in this "
"case, so that you can find the resulting data easily and avoid messing up any data "
"in the default KEGG directory. But you are of course free to do whatever you want.)"}
),
'only-database': (
['--only-database'],
'only-processing': (
['--only-processing'],
{'default': False,
'action': 'store_true',
'help': "You already have all the KEGG data you need on your computer. Perhaps you even got it from "
'help': "You already have all the KEGG data you need on your computer. Probably you even got it from "
"this program, using the --only-download option. We don't know. What matters is that you don't "
"need anything downloaded, you just want this program to setup a modules database from that "
"existing data. Good. We can do that if you provide this flag (and probably also the --kegg-data-dir "
"need anything downloaded, you just want this program to process that "
"existing data. Good. We can do that if you provide this flag (and hopefully also the --kegg-data-dir "
"in which said data is located)."}
),
'kegg-snapshot': (
['--kegg-snapshot'],
{'default': None,
'type': str,
'metavar': 'RELEASE_NUM',
'help': "If you are particularly interested in an earlier snapshot of KEGG that anvi'o knows about, you can set it here. "
"Otherwise anvi'o will always use the latest snapshot it knows about, which is likely to be the one associated with "
"the current release of anvi'o."}
'help': "The default behavior of this program is to download a pre-processed snapshot of data "
"from KEGG. If you are particularly interested in an earlier snapshot of KEGG that anvi'o "
"knows about, you can set it here. Otherwise anvi'o will always use the latest snapshot "
"it knows about, which is likely to be the one associated with the current release of anvi'o."}
),
'hide-outlier-SNVs': (
['--hide-outlier-SNVs'],
Expand Down
35 changes: 18 additions & 17 deletions anvio/biochemistry/reactionnetwork.py
Original file line number Diff line number Diff line change
Expand Up @@ -1076,7 +1076,8 @@ class KODatabase:
Unless an alternative directory is provided, the database is downloaded and set up in a
default anvi'o data directory, and loaded from this directory in network construction.
"""
default_dir = os.path.join(os.path.dirname(ANVIO_PATH), 'data/MISC/REACTION_NETWORK/KO')
default_dir = os.path.join(os.path.dirname(ANVIO_PATH), 'data/MISC/KEGG/KO_REACTION_NETWORK')
expected_files = ['ko_info.txt', 'ko_data.tsv']

def __init__(self, ko_dir: str = None) -> None:
"""
Expand All @@ -1093,19 +1094,17 @@ def __init__(self, ko_dir: str = None) -> None:
raise ConfigError(f"There is no such directory, '{ko_dir}'.")
else:
ko_dir = self.default_dir
info_path = os.path.join(ko_dir, 'ko_info.txt')
if not os.path.isfile(info_path):
raise ConfigError(f"No required file named 'ko_info.txt' was found in the KO directory, '{ko_dir}'.")
table_path = os.path.join(ko_dir, 'ko_data.tsv')
if not os.path.isfile(table_path):
raise ConfigError(f"No required file named 'ko_data.tsv' was found in the KO directory, '{ko_dir}'.")

f = open(info_path)
for expected_file in self.expected_files:
if not os.path.isfile(os.path.join(ko_dir, expected_file)):
raise ConfigError(f"No required file named '{expected_file}' was found in the KO directory, '{ko_dir}'.")

f = open(os.path.join(ko_dir, 'ko_info.txt'))
f.readline()
self.release = ' '.join(f.readline().strip().split()[1:])
f.close()

self.ko_table = pd.read_csv(table_path, sep='\t', header=0, index_col=0, low_memory=False)
self.ko_table = pd.read_csv(os.path.join(ko_dir, 'ko_data.tsv'), sep='\t', header=0, index_col=0, low_memory=False)

def set_up(
num_threads: int = 1,
Expand All @@ -1124,22 +1123,24 @@ def set_up(
Number of threads to use in parallelizing the download of KO files.
dir : str, None
Directory in which to create a new subdirectory called 'KO', in which files are
downloaded and set up. This argument overrides the default directory.
Directory in which to create a subdirectory called `KO_REACTION_NETWORK`,
in which files are downloaded and set up. This argument overrides
the default directory.
reset : bool, False
If True, remove any existing 'KO' database directory and the files therein. If False,
an exception is raised if there are files in this directory.
If True, remove any existing 'KO_REACTION_NETWORK' database directory and the files
therein. If False, an exception is raised if there are files in this directory.
run : anvio.terminal.Run, None
progress : anvio.terminal.Progress, None
"""
if dir:
if os.path.isdir(dir):
ko_dir = os.path.join(dir, 'KO')
ko_dir = os.path.join(dir, 'KO_REACTION_NETWORK')
else:
raise ConfigError(f"There is no such directory, '{dir}'.")
raise ConfigError(f"There is no such directory, '{dir}'. You should create it "
"first if you want to use it.")
else:
ko_dir = KODatabase.default_dir
parent_dir = os.path.dirname(ko_dir)
Expand Down Expand Up @@ -1242,7 +1243,7 @@ def set_up(
"from the KO database. Anvi'o will now attempt to redownload all of the files. "
)
run.info(f"Total number of KOs/entry files", total)
run.info("KEGG database version", release_after)
run.info("KEGG KO database version", release_after)
run.info("KEGG KO list", list_path)
run.info("KEGG KO info", info_path)

Expand All @@ -1264,7 +1265,7 @@ def set_up(
section = line.split()[0]
if section == 'NAME':
# The name value follows 'NAME' at the beginning of the line.
ko_data['name'] = line[4:].lstrip().rstrip()
ko_data['name'] = line[4:].strip()
# EC numbers associated with the KO are recorded at the end of the name value.
ec_string = re.search('\[EC:.*\]', line)
if ec_string:
Expand Down
2 changes: 2 additions & 0 deletions anvio/biochemistry/refdbs.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,8 @@ def raise_missing_files(self, missing: List[str]) -> None:
)

def _set_up_db_dir(self, reset: bool) -> None:
if os.path.split(self.db_dir)[0] == self.default_superdir and not os.path.exists(self.default_superdir):
os.mkdir(self.default_superdir)
if os.path.exists(self.db_dir):
if reset:
rmtree(self.db_dir)
Expand Down
29 changes: 23 additions & 6 deletions anvio/data/misc/KEGG-SNAPSHOTS.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,60 +6,77 @@ v2020-04-27:
archive_name: KEGG_build_2020-04-27_b893b7b915cb.tar.gz
hash: b893b7b915cb
modules_db_version: 1
no_modeling_data: True

v2020-06-23:
url: https://ndownloader.figshare.com/files/23701919
archive_name: KEGG_build_2020-06-23_4a75508b48aa.tar.gz
hash: 4a75508b48aa
modules_db_version: 2
no_modeling_data: True

v2020-08-06:
url: https://ndownloader.figshare.com/files/25464530
archive_name: KEGG_build_2020-08-06_8f88ef165f4c.tar.gz
hash: 8f88ef165f4c
modules_db_version: 2
no_modeling_data: True

v2020-12-23:
url: https://ndownloader.figshare.com/files/25878342
archive_name: KEGG_build_2020-12-23_45b7cc2e4fdc.tar.gz
hash: 45b7cc2e4fdc
modules_db_version: 2
no_modeling_data: True

v2021-12-18:
url: https://figshare.com/ndownloader/files/31959416
archive_name: KEGG_build_2021-12-18_58937b64c44c.tar.gz
hash: 58937b64c44c
modules_db_version: 3
no_modeling_data: True

v2022-04-14:
url: https://figshare.com/ndownloader/files/34817812
archive_name: KEGG_build_2022-04-14_666feeac5de2.tar.gz
hash: 666feeac5de2
modules_db_version: 4
no_modeling_data: True

v2023-01-10:
url: https://figshare.com/ndownloader/files/38799687
archive_name: KEGG_build_2023-01-10_d20a0dcd2128.tar.gz
hash: d20a0dcd2128
modules_db_version: 4
no_modeling_data: True

v2023-09-18:
url: https://figshare.com/ndownloader/files/42381873
archive_name: KEGG_build_2023-09-18_a2b5bde358bb.tar.gz
hash: a2b5bde358bb
modules_db_version: 4
no_modeling_data: True

v2023-09-22:
url: https://figshare.com/ndownloader/files/42428115
archive_name: KEGG_build_2023-09-22_a2b5bde358bb.tar.gz
hash: a2b5bde358bb
modules_db_version: 4

# How to add a new KEGG snapshot to this file:
# 1. download the latest data directly from KEGG by running
# `anvi-setup-kegg-kofams -D --kegg-data-dir ./KEGG`
# `anvi-setup-kegg-data -D --kegg-data-dir ./KEGG -T 5`
# 2. get the hash value and version info from the MODULES.db:
# `anvi-db-info ./KEGG/MODULES.db`
# 3. archive that directory:
# `tar -czvf KEGG_build_YYYY-MM-DD_HASH.tar.gz ./KEGG`
# Please remember to replace YYYY-MM-DD with the current date and replace HASH with the MODULES.db hash value obtained in step 2
# Please remember to replace YYYY-MM-DD with the current date and replace HASH with the
# MODULES.db hash value obtained in step 2
# 4. Test that setup works with this archive by running
# `anvi-setup-kegg-kofams --kegg-archive KEGG_build_YYYY-MM-DD_HASH.tar.gz --kegg-data-dir TEST_NEW_KEGG_ARCHIVE`
# `anvi-setup-kegg-data --kegg-archive KEGG_build_YYYY-MM-DD_HASH.tar.gz --kegg-data-dir TEST_NEW_KEGG_ARCHIVE`
# 5. Upload the .tar.gz archive to figshare and get the download url
# 6. Finally, add an entry to the bottom of this file with the url, archive name, and MODULES.db hash and version. You should also update the
# default self.target_snapshot variable in kegg.py to point to this latest version that you have added.
# 7. Test it by running `anvi-setup-kegg-kofams --kegg-data-dir TEST_NEW_KEGG`, and if it works you are done :)
# 6. Finally, add an entry to the bottom of this file with the url, archive name, and MODULES.db hash and version.
# You should also update the default self.target_snapshot variable in kegg.py to point to this
# latest version that you have added.
# 7. Test it by running `anvi-setup-kegg-data --kegg-data-dir TEST_NEW_KEGG` (you don't need to run the full thing,
# just long enough to see that the correct snapshot is being downloaded), and if it works you are done :)
2 changes: 1 addition & 1 deletion anvio/data/misc/PEOPLE/DEVELOPERS.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
linkedin: meren
orcid: 0000-0001-9013-4827
skype: a.murat.eren
bio: "Computer scientist and microbial ecologist interested in undersatnding mechanisms by which microbes interact with their surroundings, evolve, disperse, and respond to environmental change."
bio: "Computer scientist and microbial ecologist interested in understanding mechanisms by which microbes interact with their surroundings, evolve, disperse, and respond to environmental change."
affiliations:
- title: Professor
inst: Helmholtz Institute for Functional Marine Biodiversity at Oldenburg
Expand Down
2 changes: 1 addition & 1 deletion anvio/docs/artifacts/anvi-reaction-network.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
This program **generates a metabolic reaction network in a %(contigs-db)s.** Gene %(functions)s that have been annotated in the %(contigs-db)s are compared to reference databases, yielding predictions of the biochemical reactions that may be catalyzed by the gene products. Possible applications of anvi'o metabolic networks include the export of draft metabolic models (see %(anvi-get-metabolic-model-file)s) and the import and integration of metabolomic datasets.

A network can currently be generated from KEGG Orthology (KO) annotations of genes in conjunction with %(reaction-ref-data)s: KEGG ([KO](https://www.genome.jp/kegg/ko.html), [REACTION](https://www.genome.jp/kegg/reaction/), and [COMPOUND](https://www.genome.jp/kegg/compound/)) databases and the [ModelSEED Biochemistry](https://github.com/ModelSEED/ModelSEEDDatabase) database. The reference databases must have been downloaded and set up by %(anvi-setup-protein-reference-database)s.
A network can currently be generated from KEGG Orthology (KO) annotations of genes in conjunction with %(reaction-ref-data)s: KEGG ([KO](https://www.genome.jp/kegg/ko.html), [REACTION](https://www.genome.jp/kegg/reaction/), and [COMPOUND](https://www.genome.jp/kegg/compound/)) databases and the [ModelSEED Biochemistry](https://github.com/ModelSEED/ModelSEEDDatabase) database. The reference databases must have been downloaded and set up by %(anvi-setup-modelseed-database)s.
6 changes: 3 additions & 3 deletions anvio/docs/artifacts/kegg-data.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
A **directory of data** downloaded from the [KEGG database resource](https://www.kegg.jp/) for use in function annotation and metabolism estimation.

It is created by running the program %(anvi-setup-kegg-kofams)s. Not everything from KEGG is included in this directory, only the information relevant to downstream programs. The most critical components of this directory are KOfam HMM profiles and the %(modules-db)s which contains information on metabolic pathways as described in the [KEGG MODULES resource](https://www.genome.jp/kegg/module.html), as well as functional classification hierarchies from [KEGG BRITE](https://www.genome.jp/kegg/brite.html).
It is created by running the program %(anvi-setup-kegg-data)s. Not everything from KEGG is included in this directory, only the information relevant to downstream programs. The most critical components of this directory are KOfam HMM profiles and the %(modules-db)s which contains information on metabolic pathways as described in the [KEGG MODULES resource](https://www.genome.jp/kegg/module.html), as well as functional classification hierarchies from [KEGG BRITE](https://www.genome.jp/kegg/brite.html).

Programs that rely on this data directory include %(anvi-run-kegg-kofams)s and %(anvi-estimate-metabolism)s.

## Directory Location
The default location of this data is in the anvi'o folder, at `anvio/anvio/data/misc/KEGG/`.

You can change this location when you run %(anvi-setup-kegg-kofams)s by providing a different path to the `--kegg-data-dir` parameter:
You can change this location when you run %(anvi-setup-kegg-data)s by providing a different path to the `--kegg-data-dir` parameter:

{{ codestart }}
anvi-setup-kegg-kofams --kegg-data-dir /path/to/directory/KEGG
anvi-setup-kegg-data --kegg-data-dir /path/to/directory/KEGG
{{ codestop }}

If you do this, you will need to provide this path to downstream programs that require this data as well.
Expand Down

0 comments on commit 66d6493

Please sign in to comment.