Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add uploads to S3 for NCBI and Andersen Lab ingests #47

Merged
merged 9 commits into from
Jun 3, 2024

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Jun 1, 2024

Description of proposed changes

Adds ability to upload NCBI and Andersen Lab ingest outputs to the public S3 bucket. See commits for details.

Uploaded the latest ingest outputs by running:

nextstrain build \
    --env AWS_ACCESS_KEY_ID \
    --env AWS_SECRET_ACCESS_KEY \
    . \
        upload_all_ncbi \
            --configfile build-configs/ncbi/defaults/config.yaml

This uploaded files to

  • s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi
  • s3://nextstrain-data/files/workflows/avian-flu/h5n1/andersen-lab

Then ran the h5n1-cattle-outbreak build with

nextstrain build \
    --env AWS_ACCESS_KEY_ID \
    --env AWS_SECRET_ACCESS_KEY \
    . \
        --snakefile Snakefile.genome \
        --config s3_src=s3://nextstrain-data/files/workflows/avian-flu/h5n1/ncbi

Related issue(s)

Resolves #41

Checklist

  • Checks pass

All three data sources go through the same process of merging
metadata per segment into a single metadata TSV. Deduplicate the
rules and make sure we are merging the data the same way so that
the final output metadata has the same format across all data sources.

I suspect the process for merging the metadata will grow over time as we
add QC checks, so I've pulled this out into a completely separate rules
file.
Use wildcards to make the upload rules data source agnostic.
Nest each data source's wildcard name under the `s3_dst` config param
so that it's easier to upload to different S3 URLs. This will allow us
to run NCBI and Andersen lab ingests in parallel when we eventually
want to join their data.
We are planning to run this in parallel with the NCBI ingest and
eventually merge their data. They will also share more config
params with each other than with the default fauna ingest.

Andersen Lab is using NCBI SRA data, so I think it makes sense to be
under the NCBI umbrella.
Use the `segments` param to determine the default outputs for
`ingest_ncbi` to match the `merge_segment_metadata` rule.

Adds a sanity check that the requested segments are represented in
the `ncbi_segments` map.
The new target `upload_all_ncbi` will run and upload all files for
the NCBI and Andersen lab ingests.

I didn't see a need to individual rules to upload each data source
(i.e. upload_ncbi and upload_andersen_lab) but they can be added in the
future as needed.
Replace the `aws s3 cp` commands with the vendored/upload-to-s3 script
to use a couple of its built-in features:
- CloudFront invalidation
- add Metadata.sha256sum to be able to track file changes
@trvrb
Copy link
Member

trvrb commented Jun 3, 2024

This is great @joverlee521! Everything looks good to me. I also tested and it's working for me as expected.

@joverlee521 joverlee521 merged commit 4eaba84 into master Jun 3, 2024
6 checks passed
@joverlee521 joverlee521 deleted the ingest-uploads branch June 3, 2024 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ingest: Upload NCBI/Andersen lab outputs to S3
3 participants