Consolidating our KEGG data download #2121

ivagljiva · 2023-09-21T15:19:59Z

This PR consolidates the KEGG data downloading programs anvi-setup-protein-reference-database and anvi-setup-kegg-kofams into one program called anvi-setup-kegg-data. This program has various modes for independent download of different subsets of KEGG data (currently: KOfam, modules (KEGG MODULE + BRITE hierarchies => modules database), and modeling (KEGG Orthology + KEGG REACTION => table for metabolic modeling). There is also an all mode that downloads all of the above :)

The default behavior of this program is to download a snapshot of the KEGG data directory that was already set up and processed and put on Figshare. This is similar to the previous behavior of anvi-setup-kegg-kofams, except that now the snapshots will include the modeling data as well.

Here is a summary of the most important changes and caveats:

the default directory for the modeling data is now within the default KEGG data dir, at anvio/anvio/data/MISC/KEGG/KO_REACTION_NETWORK. @semiller10 you could change this directory name if you wanted :)
the KOfam-specific and module-specific functions of the KeggSetup class has been split into subclasses for setting up their respective database type: KOfamDownload and ModulesDownload
the downloads have been multi-threaded whenever possible using @semiller10's strategy originally used in anvi-setup-protein-reference-database
I've added new sanity checks downstream for anvi-run-kegg-kofams and anvi-estimate-metabolism, since now KOfams can be downloaded without associated modules data
Since earlier KEGG snapshots will not contain modeling data, I added no_modeling_data: True lines to each of the current snapshots in KEGG-SNAPSHOTS.yaml.
- When sanity checking if all expected files are present in the downloaded snapshot, we don't fail if that line is present. Instead, we warn the user that modeling data is not present and how to download it. (we shouldn't have to include those lines in any snapshot entries from this point forward)
- NOTE: the modeling data setup code still belongs to reactionnetwork.KODatabase to avoid fragmenting @semiller10's modeling code too much. This class is simply called from anvi-setup-kegg-data.
- one quirk of this is that the expected modeling files are hardcoded in the kegg.KeggSetup.kegg_archive_is_ok() function rather than being smartly loaded from the reactionnetwork.KODatabase class. We'll have to keep these two in sync unless we find a nicer solution
in order to handle arguments for each mode, I used a little hack that Ozcan originally wrote for anvi-cluster-contigs
- unfortunately, these arguments are printed AFTER the epilog program details
- another caveat of this is that the mode-specific arguments do not exist in --mode all. So, for example, you cannot use the parameter --skip-brite-hierarchies with --mode all. I think its probably fine.
the KEGG downloads have been removed from anvi-setup-protein-reference-database and that script has been renamed to anvi-setup-modelseed-database
the metabolism self-test suite runs anvi-setup-kegg-data --mode all (with the latest snapshot)
I updated the documentation as best as I could to include the new modes, parameters, and file names

I haven't exhaustively tested all possible parameter combinations, but it seems to be working. All independent modes of download have worked on my computer, as well as downloading the latest snapshot (without modeling data). I am in the process of testing the --mode all -download-from-kegg combination and it is taking forever since the modeling data is HUGE, but I believe it will work in the end because that mode simply runs the independent downloads one after another (and those are fine).

Once this final download test has finished, I will use it to create a new KEGG snapshot that includes the modeling data, and add it to KEGG-SNAPSHOTS.yaml file. I am hoping that this will be the new default snapshot associated with anvi'o v8 :)

…kegg_download_consolidation

…database

ivagljiva · 2023-09-21T15:23:01Z

@semiller10 , when you get a chance, could you please look over the changes here and make sure it all looks okay and plays nicely with the current version of your modeling code?

If it is all good, it would be amazing if we could merge this into your branch before you merge your branch into master so that it can be included in v8 :) But no pressure. I wasn't expecting this to be finished on time for the next release, so no worries if it needs to wait until our minor release.

Please let me know if there is anything I can clarify or fix. Thanks!

semiller10 · 2023-09-21T21:13:40Z

Thank you very much, @ivagljiva, the code, testing, and explanation are great. It looks like I might want to change a couple things, which I'll explain in response to your message. I certainly want to include this in v8.

This program has various modes for independent download of different subsets of KEGG data (currently: KOfam, modules (KEGG MODULE + BRITE hierarchies => modules database), and modeling (KEGG Orthology + KEGG REACTION => table for metabolic modeling).

The reaction network no longer uses the KEGG REACTION database, which turned out to be redundant with the KO and ModelSEED databases. I'll have to update the snapshot. I think there should be options to download KEGG REACTION and KEGG COMPOUND -- I had code to download the latter database in the past before I decided that I didn't need it.

one quirk of this is that the expected modeling files are hardcoded in the kegg.KeggSetup.kegg_archive_is_ok() function rather than being smartly loaded from the reactionnetwork.KODatabase class. We'll have to keep these two in sync unless we find a nicer solution

I'll update the KODatabase class with a class variable that lists the required files that need to be downloaded to construct the reaction network and reference this variable in kegg_archive_is_ok rather than hard-coding the values.

ivagljiva · 2023-09-22T07:36:57Z

Thanks @semiller10 ! First, my final test was a success so I want to proceed with creating the snapshot for v8.

Second, a few questions related to your comments:

The reaction network no longer uses the KEGG REACTION database, which turned out to be redundant with the KO and ModelSEED databases. I'll have to update the snapshot.

Do you mean that you no longer use the database created by the KODatabase.setup() function in reactionnetwork.py? Or that this function no longer sets up KEGG REACTION (which I think may be the case based on what I see in the downloaded files).

Here is what gets put into the folder for modeling data at KO_REACTION_NETWORK/:

(anvio-dev) mithrandir:test-kegg iva$ ls TEST_ALL_KEGG_DOWNLOAD/KO_REACTION_NETWORK/
ko_data.tsv       ko_entries.tar.gz ko_info.txt       ko_list.txt

Just four files (and no reference to KEGG REACTION), so I think we are good there? My mistake was that I updated the documentation (and wrote the PR lol) based on my earlier understanding of what was being downloaded, without actually checking to see what really was being downloaded. My bad.

But if this is the case, then we just need to update the documentation to accurately list what we are downloading, and there should be no need to update the snapshot (which doesn't exist yet, but I will start making it on the chance that this indeed is the correct data) :)

I think there should be options to download KEGG REACTION and KEGG COMPOUND -- I had code to download the latter database in the past before I decided that I didn't need it.

I have some generic functions for downloading KEGG stuff - namely, KeggSetup.download_generic_htext(), KeggSetup.get_accessions_from_htext_file(), KeggSetup.download_generic_flat_file(), and KeggSetup.download_kegg_files_from_hierarchy(). However, these don't perform any further processing as far as I remember. If you have code for REACTION and COMPOUND that you'd like to add, I think they could go nicely with these other functions :)

I also initially was planning to implement a custom mode so that users could download any KEGG database just by providing its name. It was going to use the functions I listed above. But I haven't gotten around to that yet. I'll see if I have time this afternoon to add it, otherwise we may want to wait on this feature for the minor release?

I'll update the KODatabase class with a class variable that lists the required files that need to be downloaded to construct the reaction network and reference this variable in kegg_archive_is_ok rather than hard-coding the values.

This is an excellent solution, thank you!

ivagljiva · 2023-09-22T07:41:05Z

OH DEAR. One big problem that I just noticed: it seems that the KOfam profiles (downloaded first) somehow get overwritten by the later downloads. There is only modules data and the reaction network data in the KEGG data dir.

I will fix it 😅

UPDATE: It is because of the --reset flag. Both KOfam and modules download can use this parameter, so if you provide it, first the KOfam download resets the directory, and then once that is finished the modules download resets the directory again, eliminating the KOfam download. It can easily be fixed by adding some logic to avoid resetting stuff a second time :)

…n. we want to encourage people to use the default mode (or read the docs if they really don't want all data so that they can take responsibility for what happens)

ivagljiva · 2023-09-22T09:06:42Z

Okay, now things appear to be working even with the --reset parameter. I also removed KEGG REACTION and COMPOUND from the documentation pages, so now it shows that we are only downloading/processing the KEGG Orthology database.

ivagljiva · 2023-09-22T09:16:59Z

EDIT: (IGNORE THIS)
There is one redundant file, ko_list.txt, in the KEGG data directory. This file is downloaded by both KOfam mode (at KEGG/ko_list.txt and modeling mode (at KEGG/KO_REACTION_NETWORK/ko_list.txt. Since we need it for both cases, I suggest that we keep it only at the top level (KEGG/ko_list.txt) and put some sanity check to see if it is already there before we download it again.

I currently only see this file referenced in reactionnetwork.py in the line:

list_path = os.path.join(ko_dir, 'ko_list.txt')

and in kegg.py in the line:

self.ko_list_file_path = os.path.join(self.kegg_data_dir, "ko_list.txt")

(and it is also expected to be at the top-level by anvi-script-gen-user-module-file, but this is less important)

I will change the list_path line in reactionnetwork.py to point to the file in the upper-level directory. And I will add sanity checks to both KOfam mode and modeling mode: If the ko_list.txt file is present, we don't download it again, and if not, we do.

ivagljiva · 2023-09-22T09:24:40Z

NEVER MIND that last comment. The two files have the same name, but are in fact different files.

KEGG/ko_list.txt:

knum threshold score_type profile_type F-measure nseq nseq_used alen mlen eff_nseq re/pos definition
K00001 369.27 domain all 0.270219 2613 2153 2015 509 13.47 0.590 alcohol dehydrogenase [EC:1.1.1.1]
K00002 455.97 full all 0.466579 2563 2450 6243 458 6.81 0.590 alcohol dehydrogenase (NADP+) [EC:1.1.1.2]

KEGG/KO_REACTION_NETWORK/ko_list.txt

K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
K00002 AKR1A1, adh; alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
K00003 hom; homoserine dehydrogenase [EC:1.1.1.3]
K00004 BDH, butB; (R,R)-butanediol dehydrogenase / meso-butanediol dehydrogenase / diacetyl reductase [EC:1.1.1.4 1.1.1.- 1.1.1.303]

we will keep both! sorry for the chaos and any heart attacks that were caused by the suggestion to delete one of these files.

ivagljiva · 2023-09-22T11:42:43Z

Snapshot and a few stats

These are the contents of the KEGG data directory after running the setup script with the -D parameter:

iva$ ls -R KEGG
BRITE               KO_REACTION_NETWORK hierarchies.json    modules             orphan_data
HMMs                MODULES.db          ko_list.txt         modules.keg

KEGG/BRITE:
ko00001 ko00535 ko01000 ko01003 ko01006 ko01009 ko02000 ko02042 ko03000 ko03012 ko03021 ko03036 ko03051 ko03200 ko03400 ko04040 ko04054 ko04121 ko04515
ko00194 ko00536 ko01001 ko01004 ko01007 ko01011 ko02022 ko02044 ko03009 ko03016 ko03029 ko03037 ko03100 ko03210 ko04030 ko04050 ko04090 ko04131 ko04812
ko00199 ko00537 ko01002 ko01005 ko01008 ko01504 ko02035 ko02048 ko03011 ko03019 ko03032 ko03041 ko03110 ko03310 ko04031 ko04052 ko04091 ko04147 ko04990

KEGG/HMMs:
Kofam.hmm     Kofam.hmm.h3f Kofam.hmm.h3i Kofam.hmm.h3m Kofam.hmm.h3p

KEGG/KO_REACTION_NETWORK:
ko_data.tsv       ko_entries.tar.gz ko_info.txt       ko_list.txt

KEGG/modules:
M00001 M00021 M00042 M00064 M00086 M00107 M00128 M00149 M00170 M00366 M00433 M00545 M00575 M00621 M00660 M00740 M00785 M00810 M00838 M00860 M00890 M00912 M00933 M00953
M00002 M00022 M00043 M00065 M00087 M00108 M00129 M00150 M00171 M00367 M00525 M00546 M00576 M00622 M00661 M00741 M00786 M00811 M00840 M00861 M00891 M00913 M00934 M00956
M00003 M00023 M00044 M00066 M00088 M00109 M00130 M00151 M00172 M00368 M00526 M00547 M00577 M00623 M00664 M00744 M00787 M00814 M00841 M00862 M00892 M00914 M00935 M00957
M00004 M00024 M00045 M00067 M00089 M00110 M00131 M00152 M00173 M00369 M00527 M00548 M00579 M00624 M00672 M00745 M00788 M00815 M00842 M00866 M00893 M00915 M00936 M00958
M00005 M00025 M00046 M00068 M00090 M00112 M00132 M00153 M00174 M00370 M00528 M00549 M00580 M00625 M00673 M00746 M00789 M00819 M00843 M00867 M00894 M00916 M00937 M00959
M00006 M00026 M00047 M00069 M00091 M00113 M00133 M00154 M00175 M00371 M00529 M00550 M00595 M00627 M00674 M00761 M00790 M00823 M00844 M00868 M00895 M00917 M00938 M00960
M00007 M00027 M00048 M00070 M00092 M00114 M00134 M00155 M00176 M00372 M00530 M00551 M00596 M00630 M00675 M00763 M00793 M00824 M00845 M00872 M00896 M00918 M00939 M00961
M00008 M00028 M00049 M00071 M00093 M00115 M00135 M00156 M00307 M00373 M00531 M00552 M00597 M00631 M00696 M00769 M00794 M00825 M00846 M00873 M00897 M00919 M00940 M00962
M00009 M00029 M00050 M00072 M00094 M00116 M00136 M00157 M00308 M00374 M00532 M00554 M00598 M00632 M00697 M00773 M00795 M00826 M00847 M00874 M00898 M00921 M00941 M00963
M00010 M00030 M00051 M00073 M00095 M00117 M00137 M00158 M00309 M00375 M00533 M00555 M00608 M00633 M00698 M00774 M00796 M00827 M00848 M00875 M00899 M00922 M00942 M00964
M00011 M00031 M00052 M00074 M00096 M00118 M00138 M00159 M00338 M00376 M00534 M00563 M00609 M00636 M00700 M00775 M00797 M00828 M00849 M00876 M00900 M00923 M00943 M00965
M00012 M00032 M00053 M00075 M00097 M00119 M00140 M00160 M00344 M00377 M00535 M00564 M00611 M00637 M00702 M00776 M00798 M00829 M00850 M00877 M00901 M00924 M00944 M00966
M00013 M00033 M00055 M00076 M00098 M00120 M00141 M00161 M00345 M00378 M00537 M00565 M00612 M00638 M00704 M00777 M00799 M00830 M00851 M00878 M00902 M00925 M00945 M00967
M00014 M00034 M00056 M00077 M00099 M00121 M00142 M00162 M00346 M00415 M00538 M00567 M00613 M00639 M00705 M00778 M00800 M00831 M00852 M00879 M00903 M00926 M00946 M00968
M00015 M00035 M00057 M00078 M00100 M00122 M00143 M00163 M00356 M00416 M00539 M00568 M00614 M00641 M00714 M00779 M00801 M00832 M00853 M00880 M00904 M00927 M00947 M00969
M00016 M00036 M00058 M00079 M00101 M00123 M00144 M00165 M00357 M00417 M00540 M00569 M00615 M00642 M00718 M00780 M00802 M00833 M00854 M00881 M00905 M00928 M00948 M00970
M00017 M00037 M00059 M00081 M00102 M00124 M00145 M00166 M00358 M00418 M00541 M00570 M00616 M00643 M00725 M00781 M00803 M00834 M00855 M00882 M00906 M00929 M00949 M00971
M00018 M00038 M00060 M00082 M00103 M00125 M00146 M00167 M00363 M00419 M00542 M00572 M00617 M00649 M00726 M00782 M00804 M00835 M00856 M00883 M00909 M00930 M00950 M00972
M00019 M00039 M00061 M00083 M00104 M00126 M00147 M00168 M00364 M00422 M00543 M00573 M00618 M00651 M00730 M00783 M00805 M00836 M00857 M00884 M00910 M00931 M00951 M00973
M00020 M00040 M00063 M00085 M00106 M00127 M00148 M00169 M00365 M00432 M00544 M00574 M00620 M00652 M00736 M00784 M00808 M00837 M00859 M00889 M00911 M00932 M00952

KEGG/orphan_data:
02_hmm_profiles_with_ko_fams_with_no_threshold.hmm

It took 1:36:44 for the setup program to download everything from KEGG, using 6 threads. The majority of that time is spent on the modeling data, as KOfams + modules takes about 12 minutes. That's only because of the sheer number of KEGG Orthology files. (Also, for comparison's sake: when I was using 5 threads yesterday, it took 2:24:12 for everything. So adding that extra thread really helps a lot).

The uncompressed KEGG data directory is 13 GB in size, and the majority of that is due to the HMMs:

iva$ du -h KEGG
13G	KEGG/HMMs
8.9M	KEGG/BRITE
338M	KEGG/orphan_data
1.9M	KEGG/modules
228M	KEGG/KO_REACTION_NETWORK
 13G	KEGG

And the archive file is 5.2 GB. I uploaded that archive to Figshare, and made it the default snapshot for anvi'o v8.

If the archive is already on your computer and you set it up with anvi-setup-kegg-data --kegg-archive KEGG_build_2023-09-22_a2b5bde358bb.tar.gz, the setup takes only 0:01:26. Running the default setup (which includes downloading this archive) takes 0:18:56.

@semiller10 , please let me know if anything is missing from this snapshot. we can still update it :)

ivagljiva · 2023-09-24T12:09:29Z

(can we please merge this so that it goes into v8? :)
@meren @semiller10

meren · 2023-09-24T12:12:40Z

Yes! I'm very happy if it is merged :) It is up to you and @semiller10 at this point :p

semiller10 · 2023-09-24T18:08:38Z

Doing a few final checks on my metabolism branch and will merge this one into that one and then that one into master within the next six hours or so. Thanks, @ivagljiva and @meren

ivagljiva added 30 commits August 29, 2023 13:49

Merge remote-tracking branch 'origin/metabolic-network-storage' into …

cd9aa86

…kegg_download_consolidation

add mode argument for selecting data to download

1f67986

the most simple refactor: conditionally run KEGG setup or refdbs setup

6b09101

a little fix for people whose PROTEIN_DATA dir doesn't exist yet

8df990b

update description

2d5181a

update copyright to 2023

f0b401e

rename setup program

bd1eef2

make kofam setup independent of the rest

f7d5d56

add kofam mode for downloading

3b3f49a

enforce sanity check for directory creation

6b9e1ff

make class specific to KOfam download mode

a140ba5

make db check function more generic

e48dbd7

we will move these sanity checks to individual subclasses

8bce614

skip_init param that will be accessible in args

eebf79f

make subclass specific to modules download

c0404d6

utilize the new subclasses in setup program

f52d260

add missing lambda func and remove stray space

bb32a87

we actually need archive path attribute in parent class

bedcff2

little init fixies

144ca5c

now this dict stores the modes

0b38947

use ozcan's hack for subparser parameters

3cacc38

section for mode specific params

8c5963a

a bit of curation of parameter help

4e25621

clarify dir param

858fe37

copy pasta error OOPS

a87a710

turns out we need this too

e921d3f

a little fixy for kofam --only-database

6b7baab

add debug output for db check

498bf49

rename only-database arg to only-processing

4e82c5d

switch to sam's newer class for modeling data download

d5e9e51

ivagljiva added 3 commits September 21, 2023 16:29

rename anvi-setup-protein-reference-database to anvi-setup-modelseed-…

e50000f

…database

Merge branch 'master' into kegg_download_consolidation

4b490d3

most recent snapshot has no modeling data associated with it

5b4dcdc

ivagljiva requested a review from semiller10 September 21, 2023 15:20

ivagljiva self-assigned this Sep 21, 2023

ivagljiva added 6 commits September 22, 2023 10:21

don't reset twice in 'all' mode

0a36d42

specify KO database in output

9bf9bdc

remove references to mode parameter wherever possible in documentatio…

9c80faa

…n. we want to encourage people to use the default mode (or read the docs if they really don't want all data so that they can take responsibility for what happens)

remove references to REACTION and COMPOUND databases in documentation

7868c64

whoops, forgot this reference to mode param

f693593

update instructions for adding new KEGG snapshot

ab0b26d

add new default snapshot for anvi'o v8 (now includes modeling data)

6ab82cc

make --reset work for snapshots and archives

423fe1f

semiller10 added 4 commits September 24, 2023 15:48

class variable of expected files

d9e1565

correct docstring

a3bf986

aesthetic

50d543a

reference expected files from KODatabase

fb10435

semiller10 merged commit 66d6493 into metabolic-network-storage Sep 25, 2023

ivagljiva deleted the kegg_download_consolidation branch September 25, 2023 07:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidating our KEGG data download #2121

Consolidating our KEGG data download #2121

ivagljiva commented Sep 21, 2023

ivagljiva commented Sep 21, 2023

semiller10 commented Sep 21, 2023

ivagljiva commented Sep 22, 2023 •

edited

Loading

ivagljiva commented Sep 22, 2023 •

edited

Loading

ivagljiva commented Sep 22, 2023

ivagljiva commented Sep 22, 2023 •

edited

Loading

ivagljiva commented Sep 22, 2023

ivagljiva commented Sep 22, 2023

ivagljiva commented Sep 24, 2023

meren commented Sep 24, 2023

semiller10 commented Sep 24, 2023

Consolidating our KEGG data download #2121

Consolidating our KEGG data download #2121

Conversation

ivagljiva commented Sep 21, 2023

ivagljiva commented Sep 21, 2023

semiller10 commented Sep 21, 2023

ivagljiva commented Sep 22, 2023 • edited Loading

ivagljiva commented Sep 22, 2023 • edited Loading

ivagljiva commented Sep 22, 2023

ivagljiva commented Sep 22, 2023 • edited Loading

ivagljiva commented Sep 22, 2023

ivagljiva commented Sep 22, 2023

Snapshot and a few stats

ivagljiva commented Sep 24, 2023

meren commented Sep 24, 2023

semiller10 commented Sep 24, 2023

ivagljiva commented Sep 22, 2023 •

edited

Loading

ivagljiva commented Sep 22, 2023 •

edited

Loading

ivagljiva commented Sep 22, 2023 •

edited

Loading