-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidating our KEGG data download #2121
Consolidating our KEGG data download #2121
Conversation
…kegg_download_consolidation
@semiller10 , when you get a chance, could you please look over the changes here and make sure it all looks okay and plays nicely with the current version of your modeling code? If it is all good, it would be amazing if we could merge this into your branch before you merge your branch into Please let me know if there is anything I can clarify or fix. Thanks! |
Thank you very much, @ivagljiva, the code, testing, and explanation are great. It looks like I might want to change a couple things, which I'll explain in response to your message. I certainly want to include this in
The reaction network no longer uses the KEGG REACTION database, which turned out to be redundant with the KO and ModelSEED databases. I'll have to update the snapshot. I think there should be options to download KEGG REACTION and KEGG COMPOUND -- I had code to download the latter database in the past before I decided that I didn't need it.
I'll update the |
Thanks @semiller10 ! First, my final test was a success so I want to proceed with creating the snapshot for v8. Second, a few questions related to your comments:
Do you mean that you no longer use the database created by the Here is what gets put into the folder for modeling data at
Just four files (and no reference to KEGG REACTION), so I think we are good there? My mistake was that I updated the documentation (and wrote the PR lol) based on my earlier understanding of what was being downloaded, without actually checking to see what really was being downloaded. My bad. But if this is the case, then we just need to update the documentation to accurately list what we are downloading, and there should be no need to update the snapshot (which doesn't exist yet, but I will start making it on the chance that this indeed is the correct data) :)
I have some generic functions for downloading KEGG stuff - namely, I also initially was planning to implement a
This is an excellent solution, thank you! |
OH DEAR. One big problem that I just noticed: it seems that the KOfam profiles (downloaded first) somehow get overwritten by the later downloads. There is only modules data and the reaction network data in the KEGG data dir. I will fix it 😅 UPDATE: It is because of the |
…n. we want to encourage people to use the default mode (or read the docs if they really don't want all data so that they can take responsibility for what happens)
Okay, now things appear to be working even with the |
EDIT: (IGNORE THIS) I currently only see this file referenced in
and in
(and it is also expected to be at the top-level by I will change the |
NEVER MIND that last comment. The two files have the same name, but are in fact different files.
we will keep both! sorry for the chaos and any heart attacks that were caused by the suggestion to delete one of these files. |
Snapshot and a few statsThese are the contents of the KEGG data directory after running the setup script with the
It took The uncompressed KEGG data directory is 13 GB in size, and the majority of that is due to the HMMs:
And the archive file is 5.2 GB. I uploaded that archive to Figshare, and made it the default snapshot for anvi'o v8. If the archive is already on your computer and you set it up with @semiller10 , please let me know if anything is missing from this snapshot. we can still update it :) |
(can we please merge this so that it goes into v8? :) |
Yes! I'm very happy if it is merged :) It is up to you and @semiller10 at this point :p |
Doing a few final checks on my metabolism branch and will merge this one into that one and then that one into master within the next six hours or so. Thanks, @ivagljiva and @meren |
This PR consolidates the KEGG data downloading programs
anvi-setup-protein-reference-database
andanvi-setup-kegg-kofams
into one program calledanvi-setup-kegg-data
. This program has various modes for independent download of different subsets of KEGG data (currently: KOfam, modules (KEGG MODULE + BRITE hierarchies => modules database), and modeling (KEGG Orthology + KEGG REACTION => table for metabolic modeling). There is also anall
mode that downloads all of the above :)The default behavior of this program is to download a snapshot of the KEGG data directory that was already set up and processed and put on Figshare. This is similar to the previous behavior of
anvi-setup-kegg-kofams
, except that now the snapshots will include the modeling data as well.Here is a summary of the most important changes and caveats:
anvio/anvio/data/MISC/KEGG/KO_REACTION_NETWORK
. @semiller10 you could change this directory name if you wanted :)KOfamDownload
andModulesDownload
anvi-setup-protein-reference-database
anvi-run-kegg-kofams
andanvi-estimate-metabolism
, since now KOfams can be downloaded without associated modules datano_modeling_data: True
lines to each of the current snapshots inKEGG-SNAPSHOTS.yaml
.reactionnetwork.KODatabase
to avoid fragmenting @semiller10's modeling code too much. This class is simply called fromanvi-setup-kegg-data
.kegg.KeggSetup.kegg_archive_is_ok()
function rather than being smartly loaded from thereactionnetwork.KODatabase
class. We'll have to keep these two in sync unless we find a nicer solutionanvi-cluster-contigs
--mode all
. So, for example, you cannot use the parameter--skip-brite-hierarchies
with--mode all
. I think its probably fine.anvi-setup-protein-reference-database
and that script has been renamed toanvi-setup-modelseed-database
anvi-setup-kegg-data --mode all
(with the latest snapshot)I haven't exhaustively tested all possible parameter combinations, but it seems to be working. All independent modes of download have worked on my computer, as well as downloading the latest snapshot (without modeling data). I am in the process of testing the
--mode all -download-from-kegg
combination and it is taking forever since the modeling data is HUGE, but I believe it will work in the end because that mode simply runs the independent downloads one after another (and those are fine).Once this final download test has finished, I will use it to create a new KEGG snapshot that includes the modeling data, and add it to
KEGG-SNAPSHOTS.yaml
file. I am hoping that this will be the new default snapshot associated with anvi'o v8 :)