Merge pull request #2121 from merenlab/kegg_download_consolidation

Consolidating our KEGG data download
merenlab · Sep 25, 2023 · 66d6493 · 66d6493
2 parents 9b88802 + fb10435
commit 66d6493
Show file tree

Hide file tree

Showing 26 changed files with 1,175 additions and 770 deletions.
diff --git a/.conda/environment.yaml b/.conda/environment.yaml
@@ -31,3 +31,4 @@ dependencies:
 - r-magrittr
 - bioconductor-qvalue
 - fastani
+- meme
diff --git a/.github/workflows/daily-component-tests-and-migrations.yaml b/.github/workflows/daily-component-tests-and-migrations.yaml
@@ -37,10 +37,13 @@ jobs:
         anvi-self-test --suite metagenomics-full --no-interactive
         anvi-self-test --suite pangenomics --no-interactive
         anvi-self-test --suite inversions --no-interactive
-        anvi-self-test --suite metabolism --no-interactive
 # the following steps cause our actions to fail on GitHub runners
 # due to space limitations :/ please do not uncomment this until we
 # have a solution for this :/
+    #- name: "Run component tests for metabolism framework"
+    #  shell: bash -l {0}
+    #  run: |
+    #    anvi-self-test --suite metabolism --no-interactive
     #- name: "Migrate ancient anvi'o databases"
     #  shell: bash -l {0}
     #  run: |

diff --git a/Dockerfiles/anvio-structure/Dockerfile b/Dockerfiles/anvio-structure/Dockerfile
@@ -72,7 +72,7 @@ RUN rm anvio-7.1.tar.gz
 # Setup anvi'o databases
 ##############################################################
 RUN anvi-setup-interacdome
-RUN anvi-setup-kegg-kofams --kegg-snapshot v2020-12-23
+RUN anvi-setup-kegg-data --kegg-snapshot v2020-12-23
 RUN anvi-setup-pfams --pfam-version 33.1
 RUN anvi-setup-ncbi-cogs --cog-version COG20
 

diff --git a/anvio/__init__.py b/anvio/__init__.py
@@ -1044,37 +1044,37 @@ def TABULATE(table, header, numalign="right", max_width=0):
                      "you will not have the most up-to-date version of KEGG for your annotations, metabolism "
                      "estimations, or any other downstream uses of this data. If that is going to be a problem for you, "
                      "do not fear - you can provide this flag to tell anvi'o to download the latest, freshest data directly "
-                     "from KEGG's REST API and set it up into an anvi'o-compatible database."}
+                     "from KEGG's REST API and set it up into anvi'o-compatible files."}
                 ),
     'only-download': (
             ['--only-download'],
             {'default': False,
              'action': 'store_true',
              'help': "You want this program to only download data from KEGG, and then stop. It will not "
-                     "make a modules database. (It would be a *very* good idea for you to specify a "
-                     "data directory using --kegg-data-dir in this case, so that you can find the resulting "
-                     "data easily and avoid messing up any data in the default KEGG directory. But you are "
-                     "of course free to do whatever you want.). Note that KOfam profiles will still be "
-                     "processed with `hmmpress` if you choose this option."}
+                     "process the data (ie, into organized HMMs or a modules database). (It would be a "
+                     "*very* good idea for you to specify a data directory using --kegg-data-dir in this "
+                     "case, so that you can find the resulting data easily and avoid messing up any data "
+                     "in the default KEGG directory. But you are of course free to do whatever you want.)"}
              ),
-    'only-database': (
-            ['--only-database'],
+    'only-processing': (
+            ['--only-processing'],
             {'default': False,
              'action': 'store_true',
-             'help': "You already have all the KEGG data you need on your computer. Perhaps you even got it from "
+             'help': "You already have all the KEGG data you need on your computer. Probably you even got it from "
                      "this program, using the --only-download option. We don't know. What matters is that you don't "
-                     "need anything downloaded, you just want this program to setup a modules database from that "
-                     "existing data. Good. We can do that if you provide this flag (and probably also the --kegg-data-dir "
+                     "need anything downloaded, you just want this program to process that "
+                     "existing data. Good. We can do that if you provide this flag (and hopefully also the --kegg-data-dir "
                      "in which said data is located)."}
              ),
     'kegg-snapshot': (
             ['--kegg-snapshot'],
             {'default': None,
              'type': str,
              'metavar': 'RELEASE_NUM',
-             'help': "If you are particularly interested in an earlier snapshot of KEGG that anvi'o knows about, you can set it here. "
-                     "Otherwise anvi'o will always use the latest snapshot it knows about, which is likely to be the one associated with "
-                     "the current release of anvi'o."}
+             'help': "The default behavior of this program is to download a pre-processed snapshot of data "
+                     "from KEGG. If you are particularly interested in an earlier snapshot of KEGG that anvi'o "
+                     "knows about, you can set it here. Otherwise anvi'o will always use the latest snapshot "
+                     "it knows about, which is likely to be the one associated with the current release of anvi'o."}
                 ),
     'hide-outlier-SNVs': (
             ['--hide-outlier-SNVs'],

diff --git a/anvio/biochemistry/reactionnetwork.py b/anvio/biochemistry/reactionnetwork.py
@@ -1076,7 +1076,8 @@ class KODatabase:
     Unless an alternative directory is provided, the database is downloaded and set up in a
     default anvi'o data directory, and loaded from this directory in network construction.
     """
-    default_dir = os.path.join(os.path.dirname(ANVIO_PATH), 'data/MISC/REACTION_NETWORK/KO')
+    default_dir = os.path.join(os.path.dirname(ANVIO_PATH), 'data/MISC/KEGG/KO_REACTION_NETWORK')
+    expected_files = ['ko_info.txt', 'ko_data.tsv']
 
     def __init__(self, ko_dir: str = None) -> None:
         """
@@ -1093,19 +1094,17 @@ def __init__(self, ko_dir: str = None) -> None:
                 raise ConfigError(f"There is no such directory, '{ko_dir}'.")
         else:
             ko_dir = self.default_dir
-        info_path = os.path.join(ko_dir, 'ko_info.txt')
-        if not os.path.isfile(info_path):
-            raise ConfigError(f"No required file named 'ko_info.txt' was found in the KO directory, '{ko_dir}'.")
-        table_path = os.path.join(ko_dir, 'ko_data.tsv')
-        if not os.path.isfile(table_path):
-            raise ConfigError(f"No required file named 'ko_data.tsv' was found in the KO directory, '{ko_dir}'.")
 
-        f = open(info_path)
+        for expected_file in self.expected_files:
+            if not os.path.isfile(os.path.join(ko_dir, expected_file)):
+                raise ConfigError(f"No required file named '{expected_file}' was found in the KO directory, '{ko_dir}'.")
+
+        f = open(os.path.join(ko_dir, 'ko_info.txt'))
         f.readline()
         self.release = ' '.join(f.readline().strip().split()[1:])
         f.close()
 
-        self.ko_table = pd.read_csv(table_path, sep='\t', header=0, index_col=0, low_memory=False)
+        self.ko_table = pd.read_csv(os.path.join(ko_dir, 'ko_data.tsv'), sep='\t', header=0, index_col=0, low_memory=False)
 
     def set_up(
         num_threads: int = 1,
@@ -1124,22 +1123,24 @@ def set_up(
             Number of threads to use in parallelizing the download of KO files.
 
         dir : str, None
-            Directory in which to create a new subdirectory called 'KO', in which files are
-            downloaded and set up. This argument overrides the default directory.
+            Directory in which to create a subdirectory called `KO_REACTION_NETWORK`,
+            in which files are downloaded and set up. This argument overrides
+            the default directory.
 
         reset : bool, False
-            If True, remove any existing 'KO' database directory and the files therein. If False,
-            an exception is raised if there are files in this directory.
+            If True, remove any existing 'KO_REACTION_NETWORK' database directory and the files
+            therein. If False, an exception is raised if there are files in this directory.
 
         run : anvio.terminal.Run, None
 
         progress : anvio.terminal.Progress, None
         """
         if dir:
             if os.path.isdir(dir):
-                ko_dir = os.path.join(dir, 'KO')
+                ko_dir = os.path.join(dir, 'KO_REACTION_NETWORK')
             else:
-                raise ConfigError(f"There is no such directory, '{dir}'.")
+                raise ConfigError(f"There is no such directory, '{dir}'. You should create it "
+                                   "first if you want to use it.")
         else:
             ko_dir = KODatabase.default_dir
             parent_dir = os.path.dirname(ko_dir)
@@ -1242,7 +1243,7 @@ def set_up(
                     "from the KO database. Anvi'o will now attempt to redownload all of the files. "
                 )
         run.info(f"Total number of KOs/entry files", total)
-        run.info("KEGG database version", release_after)
+        run.info("KEGG KO database version", release_after)
         run.info("KEGG KO list", list_path)
         run.info("KEGG KO info", info_path)
 
@@ -1264,7 +1265,7 @@ def set_up(
                     section = line.split()[0]
                 if section == 'NAME':
                     # The name value follows 'NAME' at the beginning of the line.
-                    ko_data['name'] = line[4:].lstrip().rstrip()
+                    ko_data['name'] = line[4:].strip()
                     # EC numbers associated with the KO are recorded at the end of the name value.
                     ec_string = re.search('\[EC:.*\]', line)
                     if ec_string:

diff --git a/anvio/biochemistry/refdbs.py b/anvio/biochemistry/refdbs.py
@@ -91,6 +91,8 @@ def raise_missing_files(self, missing: List[str]) -> None:
             )
 
     def _set_up_db_dir(self, reset: bool) -> None:
+        if os.path.split(self.db_dir)[0] == self.default_superdir and not os.path.exists(self.default_superdir):
+            os.mkdir(self.default_superdir)
         if os.path.exists(self.db_dir):
             if reset:
                 rmtree(self.db_dir)

diff --git a/anvio/data/misc/KEGG-SNAPSHOTS.yaml b/anvio/data/misc/KEGG-SNAPSHOTS.yaml
@@ -6,60 +6,77 @@ v2020-04-27:
     archive_name: KEGG_build_2020-04-27_b893b7b915cb.tar.gz
     hash: b893b7b915cb
     modules_db_version: 1
+    no_modeling_data: True
 
 v2020-06-23:
     url: https://ndownloader.figshare.com/files/23701919
     archive_name: KEGG_build_2020-06-23_4a75508b48aa.tar.gz
     hash: 4a75508b48aa
     modules_db_version: 2
+    no_modeling_data: True
 
 v2020-08-06:
     url: https://ndownloader.figshare.com/files/25464530
     archive_name: KEGG_build_2020-08-06_8f88ef165f4c.tar.gz
     hash: 8f88ef165f4c
     modules_db_version: 2
+    no_modeling_data: True
 
 v2020-12-23:
     url: https://ndownloader.figshare.com/files/25878342
     archive_name: KEGG_build_2020-12-23_45b7cc2e4fdc.tar.gz
     hash: 45b7cc2e4fdc
     modules_db_version: 2
+    no_modeling_data: True
 
 v2021-12-18:
     url: https://figshare.com/ndownloader/files/31959416
     archive_name: KEGG_build_2021-12-18_58937b64c44c.tar.gz
     hash: 58937b64c44c
     modules_db_version: 3
+    no_modeling_data: True
 
 v2022-04-14:
     url: https://figshare.com/ndownloader/files/34817812
     archive_name: KEGG_build_2022-04-14_666feeac5de2.tar.gz
     hash: 666feeac5de2
     modules_db_version: 4
+    no_modeling_data: True
 
 v2023-01-10:
     url: https://figshare.com/ndownloader/files/38799687
     archive_name: KEGG_build_2023-01-10_d20a0dcd2128.tar.gz
     hash: d20a0dcd2128
     modules_db_version: 4
+    no_modeling_data: True
 
 v2023-09-18:
     url: https://figshare.com/ndownloader/files/42381873
     archive_name: KEGG_build_2023-09-18_a2b5bde358bb.tar.gz
     hash: a2b5bde358bb
     modules_db_version: 4
+    no_modeling_data: True
+
+v2023-09-22:
+    url: https://figshare.com/ndownloader/files/42428115
+    archive_name: KEGG_build_2023-09-22_a2b5bde358bb.tar.gz
+    hash: a2b5bde358bb
+    modules_db_version: 4
 
 # How to add a new KEGG snapshot to this file:
 # 1. download the latest data directly from KEGG by running
-#    `anvi-setup-kegg-kofams -D --kegg-data-dir ./KEGG`
+#    `anvi-setup-kegg-data -D --kegg-data-dir ./KEGG -T 5`
 # 2. get the hash value and version info from the MODULES.db:
 #    `anvi-db-info ./KEGG/MODULES.db`
 # 3. archive that directory:
 #   `tar -czvf KEGG_build_YYYY-MM-DD_HASH.tar.gz ./KEGG`
-#   Please remember to replace YYYY-MM-DD with the current date and replace HASH with the MODULES.db hash value obtained in step 2
+#   Please remember to replace YYYY-MM-DD with the current date and replace HASH with the 
+#   MODULES.db hash value obtained in step 2
 # 4. Test that setup works with this archive by running
-#    `anvi-setup-kegg-kofams --kegg-archive KEGG_build_YYYY-MM-DD_HASH.tar.gz --kegg-data-dir TEST_NEW_KEGG_ARCHIVE`
+#    `anvi-setup-kegg-data --kegg-archive KEGG_build_YYYY-MM-DD_HASH.tar.gz --kegg-data-dir TEST_NEW_KEGG_ARCHIVE`
 # 5. Upload the .tar.gz archive to figshare and get the download url
-# 6. Finally, add an entry to the bottom of this file with the url, archive name, and MODULES.db hash and version. You should also update the
-# default self.target_snapshot variable in kegg.py to point to this latest version that you have added.
-# 7. Test it by running `anvi-setup-kegg-kofams --kegg-data-dir TEST_NEW_KEGG`, and if it works you are done :)
+# 6. Finally, add an entry to the bottom of this file with the url, archive name, and MODULES.db hash and version. 
+#    You should also update the default self.target_snapshot variable in kegg.py to point to this 
+#    latest version that you have added.
+# 7. Test it by running `anvi-setup-kegg-data --kegg-data-dir TEST_NEW_KEGG` (you don't need to run the full thing, 
+#    just long enough to see that the correct snapshot is being downloaded), and if it works you are done :)
diff --git a/anvio/data/misc/PEOPLE/DEVELOPERS.yaml b/anvio/data/misc/PEOPLE/DEVELOPERS.yaml
@@ -7,7 +7,7 @@
   linkedin: meren
   orcid: 0000-0001-9013-4827
   skype: a.murat.eren
-  bio: "Computer scientist and microbial ecologist interested in undersatnding mechanisms by which microbes interact with their surroundings, evolve, disperse, and respond to environmental change."
+  bio: "Computer scientist and microbial ecologist interested in understanding mechanisms by which microbes interact with their surroundings, evolve, disperse, and respond to environmental change."
   affiliations:
     - title: Professor
       inst: Helmholtz Institute for Functional Marine Biodiversity at Oldenburg

diff --git a/anvio/docs/artifacts/anvi-reaction-network.md b/anvio/docs/artifacts/anvi-reaction-network.md
@@ -1,3 +1,3 @@
 This program **generates a metabolic reaction network in a %(contigs-db)s.** Gene %(functions)s that have been annotated in the %(contigs-db)s are compared to reference databases, yielding predictions of the biochemical reactions that may be catalyzed by the gene products. Possible applications of anvi'o metabolic networks include the export of draft metabolic models (see %(anvi-get-metabolic-model-file)s) and the import and integration of metabolomic datasets.
 
-A network can currently be generated from KEGG Orthology (KO) annotations of genes in conjunction with %(reaction-ref-data)s: KEGG ([KO](https://www.genome.jp/kegg/ko.html), [REACTION](https://www.genome.jp/kegg/reaction/), and [COMPOUND](https://www.genome.jp/kegg/compound/)) databases and the [ModelSEED Biochemistry](https://github.com/ModelSEED/ModelSEEDDatabase) database. The reference databases must have been downloaded and set up by %(anvi-setup-protein-reference-database)s.
+A network can currently be generated from KEGG Orthology (KO) annotations of genes in conjunction with %(reaction-ref-data)s: KEGG ([KO](https://www.genome.jp/kegg/ko.html), [REACTION](https://www.genome.jp/kegg/reaction/), and [COMPOUND](https://www.genome.jp/kegg/compound/)) databases and the [ModelSEED Biochemistry](https://github.com/ModelSEED/ModelSEEDDatabase) database. The reference databases must have been downloaded and set up by %(anvi-setup-modelseed-database)s.
diff --git a/anvio/docs/artifacts/kegg-data.md b/anvio/docs/artifacts/kegg-data.md
@@ -1,16 +1,16 @@
 A **directory of data** downloaded from the [KEGG database resource](https://www.kegg.jp/) for use in function annotation and metabolism estimation.
 
-It is created by running the program %(anvi-setup-kegg-kofams)s. Not everything from KEGG is included in this directory, only the information relevant to downstream programs. The most critical components of this directory are KOfam HMM profiles and the %(modules-db)s which contains information on metabolic pathways as described in the [KEGG MODULES resource](https://www.genome.jp/kegg/module.html), as well as functional classification hierarchies from [KEGG BRITE](https://www.genome.jp/kegg/brite.html).
+It is created by running the program %(anvi-setup-kegg-data)s. Not everything from KEGG is included in this directory, only the information relevant to downstream programs. The most critical components of this directory are KOfam HMM profiles and the %(modules-db)s which contains information on metabolic pathways as described in the [KEGG MODULES resource](https://www.genome.jp/kegg/module.html), as well as functional classification hierarchies from [KEGG BRITE](https://www.genome.jp/kegg/brite.html).
 
 Programs that rely on this data directory include %(anvi-run-kegg-kofams)s and %(anvi-estimate-metabolism)s.
 
 ## Directory Location
 The default location of this data is in the anvi'o folder, at `anvio/anvio/data/misc/KEGG/`.
 
-You can change this location when you run %(anvi-setup-kegg-kofams)s by providing a different path to the `--kegg-data-dir` parameter:
+You can change this location when you run %(anvi-setup-kegg-data)s by providing a different path to the `--kegg-data-dir` parameter:
 
 {{ codestart }}
-anvi-setup-kegg-kofams --kegg-data-dir /path/to/directory/KEGG
+anvi-setup-kegg-data --kegg-data-dir /path/to/directory/KEGG
 {{ codestop }}
 
 If you do this, you will need to provide this path to downstream programs that require this data as well.