Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

102 - persist bundle metadata in separate table #185

Merged
merged 3 commits into from
Feb 29, 2024
Merged

102 - persist bundle metadata in separate table #185

merged 3 commits into from
Feb 29, 2024

Conversation

ri-pandey
Copy link
Contributor

@ri-pandey ri-pandey commented Feb 29, 2024

Description

Persist bundle metadata in separate table.

Related Issue(s)

Closes #102, possibly #37 as well.

Changes Made

List the main changes made in this PR. Be as specific as possible.

  • Feature added
  • Bug fixed
  • Code refactored
  • Documentation updated
  • Other changes: [describe]

Checklist

Before submitting this PR, please make sure that:

  • Your code passes linting and coding style checks.
  • Documentation has been updated to reflect the changes.
  • You have reviewed your own code and resolved any merge conflicts.
  • You have requested a review from at least one team member.
  • Any relevant issue(s) have been linked to this PR.

Additional Information

  • Added separate table for bundle metadata (bundle). This is linked to dataset via dataset_id.
  • Archive step is updated to evaluate bundle metadata (path, checksum, etc.), create the bundle record, and associate it with the corresponding dataset.
  • Stage step is updated to verify that the checksum of the downloaded bundle is the same as the checksum that the Archive step persisted to the database (this was the requirement for ticket Explore storing checksum of bundles in the dataset object to reduce computation #37).
  • Download step is updated to create a symlink for the bundle download.
  • All places in the code that were using dataset.bundle_size have been updated to use the size from the new bundle table.
  • A new alias bundle_alias is used to obfuscate the actual location of the bundle on Slate-scratch. I am not reusing the stage_alias for this because the stage_alias directory symlinks to the top-level directory inside dataset, instead of the dataset's directory. Therefore downloading the staged dataset with this approach would end up having the bundle inside the dataset. This is the alias provided to users who attempt to download bundles.
  • Added new paths for staging bundles

I have also added a new script (populate_bundles.py) that

  • iterates through the currently archived datasets
  • downloads them from the SDA
  • verifies that the calculated checksum of each downloaded bundle is the same as its checksum retrieved (using the hsi utility) from SDA.
  • If checksum validation passes, it runs the sync_archived_bundles workflow on these datasets, which runs the tasks archive (which populates the bundle metadata in the bundle table), stage, validate, and setup_download steps on each of them, thus preparing them for download.

ri-pandey and others added 3 commits February 29, 2024 14:40
* use separate entity for bundle metadata; validate bundle checksum after download

* changes to accommodate new data model

* use persisted metadata instead of constructing bundle name

* created endpoints

* moved block down

* use Path

* use Path

* removed comments

* removed comments

* tested endpoints

* fixed access

* throw exception in case of checksum mismatch

* show bundle size

* seed data

* show download option

* minor fixes

* migration file

* cleanup

* removed testing seed code

* added logging

* removed logging

* delete datasets before seeding

* fixed button style

* re-added comment

* added logging

* removed workflow association from seeded data

* fixed duplicate dataset names; added static ids for seeding; updated association

* fixed path in seed data

* updated origin_path

* typo fix

* added logging

* delete datasets before seeding

* made relational not optional on owning side

* removed migration

* recreated migration

* removed hardcoded ids

* temporarily removed additional admins in seeding

* use .name

* WIP - testing

* edited seed data

* changed endpoint

* use /datasets endpoints

* tested archive step

* tested stage step

* cleanup

* WIP - testing

* use bundle alias for bundle downloads

* include bundle info

* worker side changes

* eslint fixes

* cleanup

* fixed download path

* updated bundle path

* cleanup

* cleanup

* removed logging

* moved into var

* retrieve bundle info on dataset page

* fixed text

* added script for bundle population

* added script/workflow bundle population

* fixed imports/paths; catch exceptions

* use _id

* added new script to ecosystem.config

* trimmed logging

* 102 - updated usages of bundle_size

---------

Co-authored-by: scadev <scadev@bioloop-dev1.sca.iu.edu>
Co-authored-by: scadev@colo25 <scadev@colo25.carbonate.uits.iu.edu>
* 102 - remove outdated usage

* use arg

---------

Co-authored-by: scadev <scadev@bioloop-dev1.sca.iu.edu>
Co-authored-by: scadev@colo25 <scadev@colo25.carbonate.uits.iu.edu>
@ri-pandey ri-pandey self-assigned this Feb 29, 2024
@@ -85,7 +86,7 @@ def request(self, method, url, *args, **kwargs):


def str_to_int(d: dict, key: str):
d['du_size'] = utils.parse_number(d.get(key, None))
d[key] = utils.parse_number(d.get(key, None))
Copy link
Contributor Author

@ri-pandey ri-pandey Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepakduggirala I am assuming this was not intentional (maybe it was, because size and du_size are effectively the same value?). Let me know otherwise.

@ri-pandey ri-pandey merged commit 51f9360 into main Feb 29, 2024
@ri-pandey ri-pandey deleted the 102-main branch February 29, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bundle download - Store metadata and resolve hard coding path issues
2 participants