-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[build] Selective downloads and streaming upload/download #106
Conversation
610b30c
to
011dffe
Compare
Hmm, one small roadblock is that fsspec requires Python 3.6 at a minimum, but this project claims support for 3.5+. Either we relax that support to 3.6 or make fsspec usage optional somehow (but this gets ugly and complicated quick). I'm very much leaning towards bumping to 3.6, but will sleep on it and poll the rest of the team. My original reason (back in 2018) for supporting 3.5 was that Ubuntu 16.04 LTS (Xenial) was only 2 years old, so pretty common still, and shipped with 3.5. It's now ~5 years old and almost out of the standard LTS window. The following LTS release, 18.04, shipped with 3.6. There are many new standard library and language features starting with 3.6, which would be nice to avail ourselves of in this codebase! |
One more (hopefully small) roadblock from fsspec: It seems to call the rough equivalent of
|
011dffe
to
07521dc
Compare
The flag is |
Motivated by an upcoming dependency on fsspec which is 3.6+ and which would incur more overhead than I'd like to make optional. There are also many new standard library and language features starting with 3.6, which would be nice to avail ourselves of in this codebase! My original reason (back in 2018) for supporting 3.5 was that Ubuntu 16.04 LTS (Xenial) was only 2 years old, so pretty common still, and shipped with 3.5. It's now ~5 years old and almost out of the standard LTS window. The following LTS release, 18.04, shipped with 3.6. A few places in code which handled 3.5 vs. 3.6+ differences still remain, as they're not enough overhead to rip out (yet). This change warrants a major version bump for the first release including it.
…ration… …instead of after. Now you can see what large files are taking a moment to upload, instead of not knowing until they're complete!
…file This halves the local storage needed since no temporary file is involved and may also speed up the transfer of large builds since unmodified files do not need to be downloaded, just their metadata. The change also opens the door for selective downloading as well, by making use of S3's seekability via HTTP Range headers. fsspec encapsulates all these details for us into quite a nice abstraction! I expect to use it more in this project in the future; it's worked out well in ID3C. See also #83, although this sticks with zip archives.
Adds two new options to the `nextstrain build` command: --download <pattern> --no-download The former may be given multiple times and specifies patterns to match against build dir files which were modified by the remote build. The latter skips downloading results entirely, which is useful if all you care about are the logs (such as when re-attaching to a build or when a build uploads results itself elsewhere). The default is still to download every modified file. Currently limited to --aws-batch builds, as the only remote environment supported. This functionality will particularly help the ncov build shepherds, who often have the need to download only a few files from a very large build. Resolves #104. See also #83.
07521dc
to
e4dfe34
Compare
Repushed with initial commit to bump to 3.6: d7d8196 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, @tsibley. I dig the custom argparse action and the streaming read/write from S3 with fsspec. I always learn something when I review your code!
By the way, I ran into a problem with this where nextstrain-cli does not require |
@ttung Ah, apologies for the trouble! Thanks for pointing out the need to declare the dep on s3fs. The dep situation between boto3, botocore, fsspec, s3fs, and aiobotocore is a bit of hot mess (for example), so unfortunately it's not quite as simple as adding |
@ttung Released 3.0.1 just now which should (hopefully) address this issue. Thanks again! |
Description of proposed changes
❶ Support for selective downloading of remote build results
Adds two new options to the
nextstrain build
command:The former may be given multiple times and specifies patterns to match against build dir files which were modified by the remote build. The latter skips downloading results entirely, which is useful if all you care about are the logs (such as when re-attaching to a build or when a build uploads results itself elsewhere). The default is still to download every modified file.
Currently limited to
--aws-batch
builds, as the only remote environment supported.This functionality will particularly help the ncov build shepherds, who often have the need to download only a few files from a very large build.
Resolves #104. See also #83.
❷ Stream uploads/downloads without using a temporary file.
This halves the local storage needed since no temporary file is involved and may also speed up the transfer of large builds since unmodified files do not need to be downloaded, just their metadata.
The change also opens the door for selective downloading as well, by making use of S3's seekability via HTTP Range headers. fsspec encapsulates all these details for us into quite a nice abstraction! I expect to use it more in this project in the future; it's worked out well in ID3C.
See also #83, although this sticks with zip archives.
❸ Print file upload/download messages before each operation…
…instead of after. Now you can see what large files are taking a moment to upload, instead of not knowing until they're complete!
Testing
I've manually tested all combinations of the new options and existing functionality.