Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[build] Selective downloads and streaming upload/download #106

Merged
merged 4 commits into from
Feb 11, 2021

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Feb 3, 2021

Description of proposed changes

❶ Support for selective downloading of remote build results

Adds two new options to the nextstrain build command:

--download <pattern>
--no-download

The former may be given multiple times and specifies patterns to match against build dir files which were modified by the remote build. The latter skips downloading results entirely, which is useful if all you care about are the logs (such as when re-attaching to a build or when a build uploads results itself elsewhere). The default is still to download every modified file.

Currently limited to --aws-batch builds, as the only remote environment supported.

This functionality will particularly help the ncov build shepherds, who often have the need to download only a few files from a very large build.

Resolves #104. See also #83.

❷ Stream uploads/downloads without using a temporary file.

This halves the local storage needed since no temporary file is involved and may also speed up the transfer of large builds since unmodified files do not need to be downloaded, just their metadata.

The change also opens the door for selective downloading as well, by making use of S3's seekability via HTTP Range headers. fsspec encapsulates all these details for us into quite a nice abstraction! I expect to use it more in this project in the future; it's worked out well in ID3C.

See also #83, although this sticks with zip archives.

❸ Print file upload/download messages before each operation…

…instead of after. Now you can see what large files are taking a moment to upload, instead of not knowing until they're complete!

Testing

I've manually tested all combinations of the new options and existing functionality.

@tsibley
Copy link
Member Author

tsibley commented Feb 3, 2021

Hmm, one small roadblock is that fsspec requires Python 3.6 at a minimum, but this project claims support for 3.5+. Either we relax that support to 3.6 or make fsspec usage optional somehow (but this gets ugly and complicated quick). I'm very much leaning towards bumping to 3.6, but will sleep on it and poll the rest of the team.

My original reason (back in 2018) for supporting 3.5 was that Ubuntu 16.04 LTS (Xenial) was only 2 years old, so pretty common still, and shipped with 3.5. It's now ~5 years old and almost out of the standard LTS window. The following LTS release, 18.04, shipped with 3.6.

There are many new standard library and language features starting with 3.6, which would be nice to avail ourselves of in this codebase!

@tsibley
Copy link
Member Author

tsibley commented Feb 4, 2021

One more (hopefully small) roadblock from fsspec: It seems to call the rough equivalent of mkdir -p when opening an S3 object for writing, which includes an (idempotent?) CreateBucket call. If the IAM user doesn't have permission to CreateBucket, an error is thrown. Hopefully there's a flag to avoid this. Full log below.

Nextstrain Run ID: 6a45821f-08bb-4fc6-acf5-92e68fe91cda
Uploading /home/tom/nextstrain/ncov-ingest to S3
Traceback (most recent call last):
  File "/home/tom/.local/lib/python3.6/site-packages/s3fs/core.py", line 473, in _mkdir
    await self.s3.create_bucket(**params)
  File "/home/tom/.local/lib/python3.6/site-packages/aiobotocore/client.py", line 154, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Access Denied

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tom/nextstrain/ncov-ingest/../cli/bin/nextstrain", line 32, in <module>
    exit( main() )
  File "/home/tom/nextstrain/ncov-ingest/../cli/nextstrain/cli/__main__.py", line 10, in main
    return cli.run( argv[1:] )
  File "/home/tom/nextstrain/ncov-ingest/../cli/nextstrain/cli/__init__.py", line 35, in run
    return opts.__command__.run(opts)
  File "/home/tom/nextstrain/ncov-ingest/../cli/nextstrain/cli/command/build.py", line 153, in run
    return runner.run(opts, working_volume = opts.build, cpus = opts.cpus, memory = opts.memory)
  File "/home/tom/nextstrain/ncov-ingest/../cli/nextstrain/cli/runner/__init__.py", line 172, in run
    return opts.__runner__.run(opts, argv, working_volume = working_volume, extra_env = extra_env, cpus = cpus, memory = memory)
  File "/home/tom/nextstrain/ncov-ingest/../cli/nextstrain/cli/runner/aws_batch/__init__.py", line 122, in run
    remote_workdir = s3.upload_workdir(local_workdir, bucket, run_id)
  File "/home/tom/nextstrain/ncov-ingest/../cli/nextstrain/cli/runner/aws_batch/s3.py", line 67, in upload_workdir
    with fsspec.open(object_url(remote_workdir), "wb") as remote_file:
  File "/home/tom/.local/lib/python3.6/site-packages/fsspec/core.py", line 438, in open
    **kwargs
  File "/home/tom/.local/lib/python3.6/site-packages/fsspec/core.py", line 291, in open_files
    [fs.makedirs(parent, exist_ok=True) for parent in parents]
  File "/home/tom/.local/lib/python3.6/site-packages/fsspec/core.py", line 291, in <listcomp>
    [fs.makedirs(parent, exist_ok=True) for parent in parents]
  File "/home/tom/.local/lib/python3.6/site-packages/s3fs/core.py", line 488, in makedirs
    self.mkdir(path, create_parents=True)
  File "/home/tom/.local/lib/python3.6/site-packages/fsspec/asyn.py", line 121, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/home/tom/.local/lib/python3.6/site-packages/fsspec/asyn.py", line 100, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/home/tom/.local/lib/python3.6/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/home/tom/.local/lib/python3.6/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/home/tom/.local/lib/python3.6/site-packages/s3fs/core.py", line 477, in _mkdir
    raise translate_boto_error(e) from e
PermissionError: Access Denied

@tsibley
Copy link
Member Author

tsibley commented Feb 4, 2021

Hopefully there's a flag to avoid this.

The flag is auto_mkdir = False. Fixed with the repush.

Motivated by an upcoming dependency on fsspec which is 3.6+ and which
would incur more overhead than I'd like to make optional.

There are also many new standard library and language features starting
with 3.6, which would be nice to avail ourselves of in this codebase!

My original reason (back in 2018) for supporting 3.5 was that Ubuntu
16.04 LTS (Xenial) was only 2 years old, so pretty common still, and
shipped with 3.5.  It's now ~5 years old and almost out of the standard
LTS window.  The following LTS release, 18.04, shipped with 3.6.

A few places in code which handled 3.5 vs. 3.6+ differences still
remain, as they're not enough overhead to rip out (yet).

This change warrants a major version bump for the first release
including it.
…ration…

…instead of after.  Now you can see what large files are taking a moment
to upload, instead of not knowing until they're complete!
…file

This halves the local storage needed since no temporary file is involved
and may also speed up the transfer of large builds since unmodified
files do not need to be downloaded, just their metadata.

The change also opens the door for selective downloading as well, by
making use of S3's seekability via HTTP Range headers.  fsspec
encapsulates all these details for us into quite a nice abstraction!  I
expect to use it more in this project in the future; it's worked out
well in ID3C.

See also #83, although this sticks with zip archives.
Adds two new options to the `nextstrain build` command:

  --download <pattern>
  --no-download

The former may be given multiple times and specifies patterns to match
against build dir files which were modified by the remote build.  The
latter skips downloading results entirely, which is useful if all you
care about are the logs (such as when re-attaching to a build or when a
build uploads results itself elsewhere).  The default is still to
download every modified file.

Currently limited to --aws-batch builds, as the only remote environment
supported.

This functionality will particularly help the ncov build shepherds, who
often have the need to download only a few files from a very large
build.

Resolves #104.  See also #83.
@tsibley
Copy link
Member Author

tsibley commented Feb 10, 2021

Repushed with initial commit to bump to 3.6: d7d8196

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, @tsibley. I dig the custom argparse action and the streaming read/write from S3 with fsspec. I always learn something when I review your code!

@tsibley tsibley merged commit 789d4a9 into master Feb 11, 2021
@tsibley tsibley deleted the build/selective-download branch February 11, 2021 22:10
@ttung
Copy link
Contributor

ttung commented Feb 12, 2021

By the way, I ran into a problem with this where nextstrain-cli does not require s3fs, and as a result, fsspec crashes for aws batch jobs.

@tsibley
Copy link
Member Author

tsibley commented Feb 12, 2021

@ttung Ah, apologies for the trouble! Thanks for pointing out the need to declare the dep on s3fs. The dep situation between boto3, botocore, fsspec, s3fs, and aiobotocore is a bit of hot mess (for example), so unfortunately it's not quite as simple as adding s3fs or fsspec[s3] to our setup.py here. I'll work it out and cut a new release as soon as I can, though.

@tsibley
Copy link
Member Author

tsibley commented Feb 12, 2021

@ttung Released 3.0.1 just now which should (hopefully) address this issue. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow downloading of the individual output files of the AWS Batch job
3 participants