Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support direct access to s3, etc., through URL schemes. #62

Open
mariusae opened this issue Jul 26, 2018 · 1 comment
Open

Support direct access to s3, etc., through URL schemes. #62

mariusae opened this issue Jul 26, 2018 · 1 comment
Assignees

Comments

@mariusae
Copy link
Collaborator

Applications that support directly accessing S3 or other cloud storage providers could be sped up (sometimes by a lot) by avoiding staging which is currently necessary in order to make files locally available. However, this must be done in a way that permits reflow to track dependencies for cache key construction and invalidation. With S3, reflow could produce signed URLs to avoid needing to plumb through credentials (and to more tightly control access to external resources). In order to be careful about changing files, the URL should also include the assumed e-tag, and applications should try to honor this (e.g., by failing if there is an e-tag mismatch).

There are a number of possible ways to provide this functionality.

  • We could provide a builtin function, url, which renders a signed url from a file or directory where supported. This could return a tuple (string, bool) indicating whether constructing a URL was possible to do.

  • Another option is to make it an option on execs: exec(..., urls := ["s3", "gcs"]) indicating the set of storage providers that are supported natively within the exec. If any file may be accessed directly with a supported storage provider, a URL is rendered instead of a local file path. If, for whatever reason, a URL cannot be rendered (e.g., signing failed, or the storage provider doesn't support external access) then the file (or directory) is staged and the file path is rendered instead.

I think the second provides better ergonomics, though poses some challenges if you want to mix URL and local access. We could provide some ways of overriding this also.

@siddharthab
Copy link

I can think of three conditions for the url builtin:

  1. The file is only available in a cloud bucket and not locally on the alloc, or in the reflow cache.
  2. The file is not available in a cloud bucket, because it is an output from some other exec.
    2.1. The file is available in the same alloc as this exec.
    2.2. The file is not available in the alloc, but is available in the reflow cache, so it can be accessed directly from there through a signed url.

I expect it to be rare where a remote url based file is already available on the alloc because another exec was using it. But in such situations, if reflow does not handle it transparently, then option 1 will give more control to the users to optimize in the most common resource allocation and execution scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants