Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 IO #42

Merged
merged 6 commits into from
May 14, 2015
Merged

S3 IO #42

merged 6 commits into from
May 14, 2015

Conversation

warpfork
Copy link
Member

MVP on using s3 as a data silo.

S3 is a k/v store and not an immediate mapping to a filesystem (despite what the UI may seem to suggest) -- to map to a full filesystem we need more bits for file attributes, etc. This implementation uses tars as the metaphor for mapping filesystems onto s3. (Another option would be using something like https://github.com/s3fs-fuse/s3fs-fuse -- this project has its own metaphor for mapping filesystems onto s3. We decided not to use that for MVP because it's a much more complicated system; with this tarball approach, we know we can localize all network work before and after the job, rather than have interesting issues with random latency on disk mid-job. http://dsl-wiki.cs.uchicago.edu/index.php/Performance_Comparison:Remote_Usage%2C_NFS%2C_S3-fuse%2C_EBS has some other interesting performance notes; there's no short answers to what's "fastest", apparently. We can make more/different implementations in the future.)

https://github.com/rlmcpherson/s3gof3r was used to do the data shoveling and aws attribute auth signing stuff; appears to work really well.

Configuration of AWS secret tokens and keys is handled through environment variables for the time being. We will probably want to expand on this later, building a general secrets management system, for at least two reasons: env vars placed before the entire repeatr system do not have the greatest scope limitations, and also they're effectively a global (what if someone wants to use two different s3 accounts?). Works for now.

Includes a bunch of enhancements to the flexibility of the IO system shared test specs. Tests now accept parameters for URIs.

This also demos the use of different URI schemes to switch storage behavior. The "s3://" scheme will cause the rest of the URI location to be taken literally; the "s3+splay://" scheme will instead enable content-addressable naming of uploads, which means the same location string can be used for as a prefix for as much data as you want, and it will DTRT. (Presuming this goes well, we should consider using URI schemes in the other transports too; this would also let us drop the current "type" field assertions and have transports check they recognize URI schemes, which makes a lot more sense when we get to configurable/pluginable dispatches for the IO transport systems.) We've also talked about upgrading data location URIs to more complex structural data than strings can handle, but without any use cases or specific demos on the table yet, well... URIs still work well enough for jdbc after a decade or so, and we're not so different.

Setting up S3/AWS permissions models are not handled in this. Those are arcana that should be managed by your organization, not repeatr.

Going to go about this in a very MVP-oriented way: tarballs streamed in an out, because we know how to preserve file attributes in a widely-understood way by doing this.  (Another existing system that describes a FS<->S3bucket mapping is the s3fs-fuse project; we could also work with something like that in the future.  For now, we don't want to deal with fuse, large c programs of questionable portability, or deal with something that puts latency in the middle of a job and prefer to locate it predictably on the provisioning and teardown edges.)

Adding a dependency on s3gof3r, which appears to be a good, solid readymade API to stream large files to s3.  (In the future this API is something we might actually prefer to use via an exec path, so we can have it behave as an optional tool rather than bloating the main binary for folk who aren't interested in it... but we'll do direct binds for now; it's pretty tiny, anyway.)

Signed-off-by: Eric Myhre <hash@exultant.us>
Signed-off-by: Eric Myhre <hash@exultant.us>
Added a URL paramater to the common output test specifications.

Signed-off-by: Eric Myhre <hash@exultant.us>
Having these common tests is paying huge dividends.  We should have more of them, to be sure, but even this far it's a huge enabler.

There's a *lot* of identical code here between the s3 inputs and outputs.  It's messy.  This will be fixed soon: There's a heavy IO rework floating in progress in another branch which will involve refolding this code to address that issue -- it's so far it's consistently emerged that pairs of inputs and outputs have way more common with their pair than they do with their category.

Signed-off-by: Eric Myhre <hash@exultant.us>
Using the URI scheme to indicate a "splay" mode causes the s3 input and output systems to save data under content-addressible hashes (prefixed by a selected path) -- this allows easily storing sets of data under a single URI.

(A better test would shovel several datasets into the same prefix at once, and then make sure they're all uniquely still reachable... this should be worked into the common tests in the future.)

Signed-off-by: Eric Myhre <hash@exultant.us>
// verify total integrity
expectedTreeHash, err := base64.URLEncoding.DecodeString(i.spec.Hash)
if !bytes.Equal(actualTreeHash, expectedTreeHash) {
done <- input.InputHashMismatchError.New("expected hash %q, got %q", i.spec.Hash, base64.URLEncoding.EncodeToString(actualTreeHash))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in understanding that, to use this implementation, you're required to:

  1. Only use s3 assets produced (or scanned) by repeatr's custom sha384 logic
  2. Preserve the custom sha384 along with the upload URL, and not the s3-specific hash (md5)

To use this transport?

For example, it sounds like the workflow to use existing s3 assets right now would be:

  1. Download the s3 asset with some other tool (probably the gof3r binary, actually)
  2. Scan it with repeatr
  3. Provide it to a formula, probably as a dir input

Both limitations should be fine for a demonstration; I want to make sure I fully understand.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, AWS's use of md5 is... cough.

If you have any existing asset in s3, you can use the non-"splay" type of location to load it... though if you're using repeatr as a coherent system, that's probably going to be an unusual situation to be in.

But yes, we badly need a repeatr scan cli command to make it to the tip of the todo heap.

@kofalt
Copy link
Member

kofalt commented May 11, 2015

This looks great! Two limitations I'd like to solve:

  1. Omitting the hostname from the URI - arbitrarily many s3 endpoints, third-party implementations, etc.
  2. Omitting the transport from the URI - s3 scheme is configurable, don't use SSL on internal proxy, etc

Fixing these two points will be pretty easy without undue effort or going down the rabbit hole. I think we should gun for flexibility & simplicity here by just saying "yeah, we don't know what our final destination for URIs might be, for now we're just going to concatenate".

Current:  s3+splay://repeatr-test/test-9/rt-splay/heap/
Proposed: s3+splay://https://s3.amazonaws.com/repeatr-test/test-9/rt-splay/heap/

From net/url:

The general form represented is:
scheme://[userinfo@]host/path[?query][#fragment]

And so it becomes trivial to parse the URI twice, in a safe & reasonable manner:

Temporary repeatr s3 type: s3+splay
Actual s3 address:         https://s3.amazonaws.com/repeatr-test/test-9/rt-splay/heap/
Bucket name:               repeatr-test
Store path:                test-9/rt-splay/heap/

Above link is a sandbox example of doing just that - calling Parse twice. A tad silly; KISS principle.

(Large chunks of this are just porting back in from the PR description: #42)

Signed-off-by: Eric Myhre <hash@exultant.us>
@warpfork
Copy link
Member Author

These are valid things to worry about, but also at the moment I don't want to go too far down the road of writing code for things we don't have tests for yet.

(Boy, it'd be cool to have an S3 test we can run against an in-memory store in our own sandbox, though.)

Another alternative would put stuffing more parameters like whether or not to use https in query params. That's what JDBC stuff usually did, iirc.

@warpfork
Copy link
Member Author

Docs should stand much improved... I think I'm happy enough to consider merging this (expecting many more PRs to enhance in the future).

@kofalt
Copy link
Member

kofalt commented May 14, 2015

Okay, that sounds reasonable for an MVP. I'll merge this as-is and we can come back to my comment when it's time to come back & design these properly.

kofalt added a commit that referenced this pull request May 14, 2015
@kofalt kofalt merged commit 38cd314 into master May 14, 2015
@kofalt kofalt deleted the s3 branch May 14, 2015 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants