-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add uploads to S3 for NCBI and Andersen Lab ingests #47
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
All three data sources go through the same process of merging metadata per segment into a single metadata TSV. Deduplicate the rules and make sure we are merging the data the same way so that the final output metadata has the same format across all data sources. I suspect the process for merging the metadata will grow over time as we add QC checks, so I've pulled this out into a completely separate rules file.
Use wildcards to make the upload rules data source agnostic. Nest each data source's wildcard name under the `s3_dst` config param so that it's easier to upload to different S3 URLs. This will allow us to run NCBI and Andersen lab ingests in parallel when we eventually want to join their data.
We are planning to run this in parallel with the NCBI ingest and eventually merge their data. They will also share more config params with each other than with the default fauna ingest. Andersen Lab is using NCBI SRA data, so I think it makes sense to be under the NCBI umbrella.
Use the `segments` param to determine the default outputs for `ingest_ncbi` to match the `merge_segment_metadata` rule. Adds a sanity check that the requested segments are represented in the `ncbi_segments` map.
The new target `upload_all_ncbi` will run and upload all files for the NCBI and Andersen lab ingests. I didn't see a need to individual rules to upload each data source (i.e. upload_ncbi and upload_andersen_lab) but they can be added in the future as needed.
Replace the `aws s3 cp` commands with the vendored/upload-to-s3 script to use a couple of its built-in features: - CloudFront invalidation - add Metadata.sha256sum to be able to track file changes
joverlee521
force-pushed
the
ingest-uploads
branch
from
June 1, 2024 01:00
65ac09f
to
e86aa97
Compare
genehack
approved these changes
Jun 3, 2024
This is great @joverlee521! Everything looks good to me. I also tested and it's working for me as expected. |
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
Adds ability to upload NCBI and Andersen Lab ingest outputs to the public S3 bucket. See commits for details.
Uploaded the latest ingest outputs by running:
This uploaded files to
s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi
s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab
Then ran the h5n1-cattle-outbreak build with
Related issue(s)
Resolves #41
Checklist