Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ncov-ingest to download from non-AWS remote file paths #307

Open
huddlej opened this issue May 13, 2022 · 2 comments
Open

Allow ncov-ingest to download from non-AWS remote file paths #307

huddlej opened this issue May 13, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@huddlej
Copy link
Contributor

huddlej commented May 13, 2022

Context

To support running ingest on Terra, we need to support downloading existing Nextclade alignments and metadata from remote storage other than AWS's S3.

Description

We need to support the definition of a "source" bucket in the workflow configuration YAML associated with at least S3 or GS URIs. This means changing the name of the configuration variable from s3_src to a more generic name and modifying all of the logic in the workflow that refers specifically to downloading from S3 (e.g., "download_from_s3" script, etc.).

As part of this work, we will also need to modify the Pipfile used to populate the ncov-ingest Docker image by adding the Python bindings for Google Cloud Storage. See the Dockerfile for the docker-base image. See @tsibley's comments below.

Examples

The modified ingest should continue to work with our production S3 buckets, but it should also work from GS buckets accessed through Terra.

See code in ncov for handling remote files.

@tsibley
Copy link
Member

tsibley commented May 13, 2022

As part of this work, we will also need to modify the Pipfile used to populate the ncov-ingest Docker image by adding the Python bindings for Google Cloud Storage. See the Dockerfile for the docker-base image.

This should not be necessary, as the nextstrain/ncov-ingest image is based on the nextstrain/base image. I believe the only reason the GCS Python bindings aren't available in the latest nextstrain/ncov-ingest image is that it predates (~3 Feb) the addition of the bindings to the nextstrain/base image (~11 Feb). I triggered an image update which is running now, and that should take care of bringing in the GCS bindings.

@tsibley
Copy link
Member

tsibley commented May 13, 2022

Works in the latest image now:

$ docker image ls --digests nextstrain/ncov-ingest
REPOSITORY               TAG       DIGEST                                                                    IMAGE ID       CREATED          SIZE
nextstrain/ncov-ingest   latest    sha256:2bb78caa7dfc38703724a5a36fa280744256d53f18c18c6e371a6c4048c77b65   d150b94d8afd   20 minutes ago   2.36GB
nextstrain/ncov-ingest   <none>    sha256:63bd513524c1e71eb1ce60c1fd4ac90d21ba443a43865c60062b96bf433e7536   d2f7e3fef415   3 months ago     2.4GB

$ docker run --rm -it nextstrain/ncov-ingest@sha256:2bb78caa7dfc38703724a5a36fa280744256d53f18c18c6e371a6c4048c77b65 python3 -c 'from google.cloud import storage'
# 👍 no error

$ docker run --rm -it nextstrain/ncov-ingest@sha256:63bd513524c1e71eb1ce60c1fd4ac90d21ba443a43865c60062b96bf433e7536 python3 -c 'from google.cloud import storage'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'google'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: Prioritized
Development

No branches or pull requests

2 participants