S3 IO #42

warpfork · 2015-04-30T14:43:20Z

MVP on using s3 as a data silo.

S3 is a k/v store and not an immediate mapping to a filesystem (despite what the UI may seem to suggest) -- to map to a full filesystem we need more bits for file attributes, etc. This implementation uses tars as the metaphor for mapping filesystems onto s3. (Another option would be using something like https://github.com/s3fs-fuse/s3fs-fuse -- this project has its own metaphor for mapping filesystems onto s3. We decided not to use that for MVP because it's a much more complicated system; with this tarball approach, we know we can localize all network work before and after the job, rather than have interesting issues with random latency on disk mid-job. http://dsl-wiki.cs.uchicago.edu/index.php/Performance_Comparison:Remote_Usage%2C_NFS%2C_S3-fuse%2C_EBS has some other interesting performance notes; there's no short answers to what's "fastest", apparently. We can make more/different implementations in the future.)

https://github.com/rlmcpherson/s3gof3r was used to do the data shoveling and aws attribute auth signing stuff; appears to work really well.

Configuration of AWS secret tokens and keys is handled through environment variables for the time being. We will probably want to expand on this later, building a general secrets management system, for at least two reasons: env vars placed before the entire repeatr system do not have the greatest scope limitations, and also they're effectively a global (what if someone wants to use two different s3 accounts?). Works for now.

Includes a bunch of enhancements to the flexibility of the IO system shared test specs. Tests now accept parameters for URIs.

This also demos the use of different URI schemes to switch storage behavior. The "s3://" scheme will cause the rest of the URI location to be taken literally; the "s3+splay://" scheme will instead enable content-addressable naming of uploads, which means the same location string can be used for as a prefix for as much data as you want, and it will DTRT. (Presuming this goes well, we should consider using URI schemes in the other transports too; this would also let us drop the current "type" field assertions and have transports check they recognize URI schemes, which makes a lot more sense when we get to configurable/pluginable dispatches for the IO transport systems.) We've also talked about upgrading data location URIs to more complex structural data than strings can handle, but without any use cases or specific demos on the table yet, well... URIs still work well enough for jdbc after a decade or so, and we're not so different.

Setting up S3/AWS permissions models are not handled in this. Those are arcana that should be managed by your organization, not repeatr.

Going to go about this in a very MVP-oriented way: tarballs streamed in an out, because we know how to preserve file attributes in a widely-understood way by doing this. (Another existing system that describes a FS<->S3bucket mapping is the s3fs-fuse project; we could also work with something like that in the future. For now, we don't want to deal with fuse, large c programs of questionable portability, or deal with something that puts latency in the middle of a job and prefer to locate it predictably on the provisioning and teardown edges.) Adding a dependency on s3gof3r, which appears to be a good, solid readymade API to stream large files to s3. (In the future this API is something we might actually prefer to use via an exec path, so we can have it behave as an optional tool rather than bloating the main binary for folk who aren't interested in it... but we'll do direct binds for now; it's pretty tiny, anyway.) Signed-off-by: Eric Myhre <hash@exultant.us>

Signed-off-by: Eric Myhre <hash@exultant.us>

Added a URL paramater to the common output test specifications. Signed-off-by: Eric Myhre <hash@exultant.us>

Having these common tests is paying huge dividends. We should have more of them, to be sure, but even this far it's a huge enabler. There's a *lot* of identical code here between the s3 inputs and outputs. It's messy. This will be fixed soon: There's a heavy IO rework floating in progress in another branch which will involve refolding this code to address that issue -- it's so far it's consistently emerged that pairs of inputs and outputs have way more common with their pair than they do with their category. Signed-off-by: Eric Myhre <hash@exultant.us>

Using the URI scheme to indicate a "splay" mode causes the s3 input and output systems to save data under content-addressible hashes (prefixed by a selected path) -- this allows easily storing sets of data under a single URI. (A better test would shovel several datasets into the same prefix at once, and then make sure they're all uniquely still reachable... this should be worked into the common tests in the future.) Signed-off-by: Eric Myhre <hash@exultant.us>

kofalt · 2015-05-11T05:35:26Z

input/s3/s3_input.go

+			// verify total integrity
+			expectedTreeHash, err := base64.URLEncoding.DecodeString(i.spec.Hash)
+			if !bytes.Equal(actualTreeHash, expectedTreeHash) {
+				done <- input.InputHashMismatchError.New("expected hash %q, got %q", i.spec.Hash, base64.URLEncoding.EncodeToString(actualTreeHash))


Am I correct in understanding that, to use this implementation, you're required to:

Only use s3 assets produced (or scanned) by repeatr's custom sha384 logic

Preserve the custom sha384 along with the upload URL, and not the s3-specific hash (md5)

To use this transport?

For example, it sounds like the workflow to use existing s3 assets right now would be:

Download the s3 asset with some other tool (probably the gof3r binary, actually)

Scan it with repeatr

Provide it to a formula, probably as a dir input

Both limitations should be fine for a demonstration; I want to make sure I fully understand.

Correct, AWS's use of md5 is... cough.

If you have any existing asset in s3, you can use the non-"splay" type of location to load it... though if you're using repeatr as a coherent system, that's probably going to be an unusual situation to be in.

But yes, we badly need a repeatr scan cli command to make it to the tip of the todo heap.

kofalt · 2015-05-11T05:38:45Z

This looks great! Two limitations I'd like to solve:

Omitting the hostname from the URI - arbitrarily many s3 endpoints, third-party implementations, etc.
Omitting the transport from the URI - s3 scheme is configurable, don't use SSL on internal proxy, etc

Fixing these two points will be pretty easy without undue effort or going down the rabbit hole. I think we should gun for flexibility & simplicity here by just saying "yeah, we don't know what our final destination for URIs might be, for now we're just going to concatenate".

Current:  s3+splay://repeatr-test/test-9/rt-splay/heap/
Proposed: s3+splay://https://s3.amazonaws.com/repeatr-test/test-9/rt-splay/heap/

From net/url:

The general form represented is:
scheme://[userinfo@]host/path[?query][#fragment]

And so it becomes trivial to parse the URI twice, in a safe & reasonable manner:

Temporary repeatr s3 type: s3+splay
Actual s3 address:         https://s3.amazonaws.com/repeatr-test/test-9/rt-splay/heap/
Bucket name:               repeatr-test
Store path:                test-9/rt-splay/heap/

Above link is a sandbox example of doing just that - calling Parse twice. A tad silly; KISS principle.

(Large chunks of this are just porting back in from the PR description: #42) Signed-off-by: Eric Myhre <hash@exultant.us>

warpfork · 2015-05-13T00:39:41Z

These are valid things to worry about, but also at the moment I don't want to go too far down the road of writing code for things we don't have tests for yet.

(Boy, it'd be cool to have an S3 test we can run against an in-memory store in our own sandbox, though.)

Another alternative would put stuffing more parameters like whether or not to use https in query params. That's what JDBC stuff usually did, iirc.

warpfork · 2015-05-13T00:44:00Z

Docs should stand much improved... I think I'm happy enough to consider merging this (expecting many more PRs to enhance in the future).

kofalt · 2015-05-14T22:01:32Z

Okay, that sounds reasonable for an MVP. I'll merge this as-is and we can come back to my comment when it's time to come back & design these properly.

S3 IO

warpfork added 3 commits April 30, 2015 18:17

First pass on s3 outputs. Placeholders; demo.

695533b

Signed-off-by: Eric Myhre <hash@exultant.us>

S3 output writer, plus support for CA storage urls

7c85478

Added a URL paramater to the common output test specifications. Signed-off-by: Eric Myhre <hash@exultant.us>

warpfork force-pushed the s3 branch from 0e600f2 to b0a8d01 Compare April 30, 2015 23:30

warpfork added 2 commits April 30, 2015 18:31

kofalt reviewed May 11, 2015
View reviewed changes

More S3 transport documentation.

4da5a0a

(Large chunks of this are just porting back in from the PR description: #42) Signed-off-by: Eric Myhre <hash@exultant.us>

kofalt added a commit that referenced this pull request May 14, 2015

Merge pull request #42 from polydawn/s3

38cd314

S3 IO

kofalt merged commit 38cd314 into master May 14, 2015

kofalt deleted the s3 branch May 14, 2015 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 IO #42

S3 IO #42

warpfork commented Apr 30, 2015

kofalt May 11, 2015

warpfork May 11, 2015

kofalt commented May 11, 2015

warpfork commented May 13, 2015

warpfork commented May 13, 2015

kofalt commented May 14, 2015

S3 IO #42

S3 IO #42

Conversation

warpfork commented Apr 30, 2015

kofalt May 11, 2015

Choose a reason for hiding this comment

warpfork May 11, 2015

Choose a reason for hiding this comment

kofalt commented May 11, 2015

warpfork commented May 13, 2015

warpfork commented May 13, 2015

kofalt commented May 14, 2015