102 - persist bundle metadata in separate table #185

ri-pandey · 2024-02-29T20:33:11Z

Description

Persist bundle metadata in separate table.

Related Issue(s)

Closes #102, possibly #37 as well.

Changes Made

List the main changes made in this PR. Be as specific as possible.

Checklist

Before submitting this PR, please make sure that:

Your code passes linting and coding style checks.
Documentation has been updated to reflect the changes.
You have reviewed your own code and resolved any merge conflicts.
You have requested a review from at least one team member.
Any relevant issue(s) have been linked to this PR.

Additional Information

Added separate table for bundle metadata (bundle). This is linked to dataset via dataset_id.
Archive step is updated to evaluate bundle metadata (path, checksum, etc.), create the bundle record, and associate it with the corresponding dataset.
Stage step is updated to verify that the checksum of the downloaded bundle is the same as the checksum that the Archive step persisted to the database (this was the requirement for ticket Explore storing checksum of bundles in the dataset object to reduce computation #37).
Download step is updated to create a symlink for the bundle download.
All places in the code that were using dataset.bundle_size have been updated to use the size from the new bundle table.
A new alias bundle_alias is used to obfuscate the actual location of the bundle on Slate-scratch. I am not reusing the stage_alias for this because the stage_alias directory symlinks to the top-level directory inside dataset, instead of the dataset's directory. Therefore downloading the staged dataset with this approach would end up having the bundle inside the dataset. This is the alias provided to users who attempt to download bundles.
Added new paths for staging bundles

I have also added a new script (populate_bundles.py) that

iterates through the currently archived datasets
downloads them from the SDA
verifies that the calculated checksum of each downloaded bundle is the same as its checksum retrieved (using the hsi utility) from SDA.
If checksum validation passes, it runs the sync_archived_bundles workflow on these datasets, which runs the tasks archive (which populates the bundle metadata in the bundle table), stage, validate, and setup_download steps on each of them, thus preparing them for download.

* use separate entity for bundle metadata; validate bundle checksum after download * changes to accommodate new data model * use persisted metadata instead of constructing bundle name * created endpoints * moved block down * use Path * use Path * removed comments * removed comments * tested endpoints * fixed access * throw exception in case of checksum mismatch * show bundle size * seed data * show download option * minor fixes * migration file * cleanup * removed testing seed code * added logging * removed logging * delete datasets before seeding * fixed button style * re-added comment * added logging * removed workflow association from seeded data * fixed duplicate dataset names; added static ids for seeding; updated association * fixed path in seed data * updated origin_path * typo fix * added logging * delete datasets before seeding * made relational not optional on owning side * removed migration * recreated migration * removed hardcoded ids * temporarily removed additional admins in seeding * use .name * WIP - testing * edited seed data * changed endpoint * use /datasets endpoints * tested archive step * tested stage step * cleanup * WIP - testing * use bundle alias for bundle downloads * include bundle info * worker side changes * eslint fixes * cleanup * fixed download path * updated bundle path * cleanup * cleanup * removed logging * moved into var * retrieve bundle info on dataset page * fixed text * added script for bundle population * added script/workflow bundle population * fixed imports/paths; catch exceptions * use _id * added new script to ecosystem.config * trimmed logging * 102 - updated usages of bundle_size --------- Co-authored-by: scadev <scadev@bioloop-dev1.sca.iu.edu> Co-authored-by: scadev@colo25 <scadev@colo25.carbonate.uits.iu.edu>

* 102 - remove outdated usage * use arg --------- Co-authored-by: scadev <scadev@bioloop-dev1.sca.iu.edu> Co-authored-by: scadev@colo25 <scadev@colo25.carbonate.uits.iu.edu>

ri-pandey · 2024-02-29T20:44:29Z

workers/workers/api.py

@@ -85,7 +86,7 @@ def request(self, method, url, *args, **kwargs):


 def str_to_int(d: dict, key: str):
-    d['du_size'] = utils.parse_number(d.get(key, None))
+    d[key] = utils.parse_number(d.get(key, None))


@deepakduggirala I am assuming this was not intentional (maybe it was, because size and du_size are effectively the same value?). Let me know otherwise.

ri-pandey and others added 3 commits February 29, 2024 14:40

102 bundle size usages (#183)

38e352a

* 102 - remove outdated usage * use arg --------- Co-authored-by: scadev <scadev@bioloop-dev1.sca.iu.edu> Co-authored-by: scadev@colo25 <scadev@colo25.carbonate.uits.iu.edu>

updated docs (#184)

cc03f2f

ri-pandey requested a review from deepakduggirala February 29, 2024 20:35

ri-pandey self-assigned this Feb 29, 2024

ri-pandey commented Feb 29, 2024

View reviewed changes

ri-pandey requested review from charlesbrandt and ryanlong89 February 29, 2024 21:02

ryanlong89 approved these changes Feb 29, 2024

View reviewed changes

ri-pandey merged commit 51f9360 into main Feb 29, 2024

ri-pandey deleted the 102-main branch February 29, 2024 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

102 - persist bundle metadata in separate table #185

102 - persist bundle metadata in separate table #185

ri-pandey commented Feb 29, 2024 •

edited

Loading

ri-pandey Feb 29, 2024 •

edited

Loading

102 - persist bundle metadata in separate table #185

102 - persist bundle metadata in separate table #185

Conversation

ri-pandey commented Feb 29, 2024 • edited Loading

ri-pandey Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

ri-pandey commented Feb 29, 2024 •

edited

Loading

ri-pandey Feb 29, 2024 •

edited

Loading