-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 IO #42
Conversation
Going to go about this in a very MVP-oriented way: tarballs streamed in an out, because we know how to preserve file attributes in a widely-understood way by doing this. (Another existing system that describes a FS<->S3bucket mapping is the s3fs-fuse project; we could also work with something like that in the future. For now, we don't want to deal with fuse, large c programs of questionable portability, or deal with something that puts latency in the middle of a job and prefer to locate it predictably on the provisioning and teardown edges.) Adding a dependency on s3gof3r, which appears to be a good, solid readymade API to stream large files to s3. (In the future this API is something we might actually prefer to use via an exec path, so we can have it behave as an optional tool rather than bloating the main binary for folk who aren't interested in it... but we'll do direct binds for now; it's pretty tiny, anyway.) Signed-off-by: Eric Myhre <hash@exultant.us>
Signed-off-by: Eric Myhre <hash@exultant.us>
Added a URL paramater to the common output test specifications. Signed-off-by: Eric Myhre <hash@exultant.us>
Having these common tests is paying huge dividends. We should have more of them, to be sure, but even this far it's a huge enabler. There's a *lot* of identical code here between the s3 inputs and outputs. It's messy. This will be fixed soon: There's a heavy IO rework floating in progress in another branch which will involve refolding this code to address that issue -- it's so far it's consistently emerged that pairs of inputs and outputs have way more common with their pair than they do with their category. Signed-off-by: Eric Myhre <hash@exultant.us>
Using the URI scheme to indicate a "splay" mode causes the s3 input and output systems to save data under content-addressible hashes (prefixed by a selected path) -- this allows easily storing sets of data under a single URI. (A better test would shovel several datasets into the same prefix at once, and then make sure they're all uniquely still reachable... this should be worked into the common tests in the future.) Signed-off-by: Eric Myhre <hash@exultant.us>
// verify total integrity | ||
expectedTreeHash, err := base64.URLEncoding.DecodeString(i.spec.Hash) | ||
if !bytes.Equal(actualTreeHash, expectedTreeHash) { | ||
done <- input.InputHashMismatchError.New("expected hash %q, got %q", i.spec.Hash, base64.URLEncoding.EncodeToString(actualTreeHash)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct in understanding that, to use this implementation, you're required to:
- Only use s3 assets produced (or scanned) by repeatr's custom sha384 logic
- Preserve the custom sha384 along with the upload URL, and not the s3-specific hash (md5)
To use this transport?
For example, it sounds like the workflow to use existing s3 assets right now would be:
- Download the s3 asset with some other tool (probably the
gof3r
binary, actually) - Scan it with repeatr
- Provide it to a formula, probably as a
dir
input
Both limitations should be fine for a demonstration; I want to make sure I fully understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, AWS's use of md5 is... cough.
If you have any existing asset in s3, you can use the non-"splay" type of location to load it... though if you're using repeatr as a coherent system, that's probably going to be an unusual situation to be in.
But yes, we badly need a repeatr scan
cli command to make it to the tip of the todo heap.
This looks great! Two limitations I'd like to solve:
Fixing these two points will be pretty easy without undue effort or going down the rabbit hole. I think we should gun for flexibility & simplicity here by just saying "yeah, we don't know what our final destination for URIs might be, for now we're just going to concatenate".
From net/url:
And so it becomes trivial to parse the URI twice, in a safe & reasonable manner:
Above link is a sandbox example of doing just that - calling Parse twice. A tad silly; KISS principle. |
(Large chunks of this are just porting back in from the PR description: #42) Signed-off-by: Eric Myhre <hash@exultant.us>
These are valid things to worry about, but also at the moment I don't want to go too far down the road of writing code for things we don't have tests for yet. (Boy, it'd be cool to have an S3 test we can run against an in-memory store in our own sandbox, though.) Another alternative would put stuffing more parameters like whether or not to use https in query params. That's what JDBC stuff usually did, iirc. |
Docs should stand much improved... I think I'm happy enough to consider merging this (expecting many more PRs to enhance in the future). |
Okay, that sounds reasonable for an MVP. I'll merge this as-is and we can come back to my comment when it's time to come back & design these properly. |
MVP on using s3 as a data silo.
S3 is a k/v store and not an immediate mapping to a filesystem (despite what the UI may seem to suggest) -- to map to a full filesystem we need more bits for file attributes, etc. This implementation uses tars as the metaphor for mapping filesystems onto s3. (Another option would be using something like https://github.com/s3fs-fuse/s3fs-fuse -- this project has its own metaphor for mapping filesystems onto s3. We decided not to use that for MVP because it's a much more complicated system; with this tarball approach, we know we can localize all network work before and after the job, rather than have interesting issues with random latency on disk mid-job. http://dsl-wiki.cs.uchicago.edu/index.php/Performance_Comparison:Remote_Usage%2C_NFS%2C_S3-fuse%2C_EBS has some other interesting performance notes; there's no short answers to what's "fastest", apparently. We can make more/different implementations in the future.)
https://github.com/rlmcpherson/s3gof3r was used to do the data shoveling and aws attribute auth signing stuff; appears to work really well.
Configuration of AWS secret tokens and keys is handled through environment variables for the time being. We will probably want to expand on this later, building a general secrets management system, for at least two reasons: env vars placed before the entire repeatr system do not have the greatest scope limitations, and also they're effectively a global (what if someone wants to use two different s3 accounts?). Works for now.
Includes a bunch of enhancements to the flexibility of the IO system shared test specs. Tests now accept parameters for URIs.
This also demos the use of different URI schemes to switch storage behavior. The "s3://" scheme will cause the rest of the URI location to be taken literally; the "s3+splay://" scheme will instead enable content-addressable naming of uploads, which means the same location string can be used for as a prefix for as much data as you want, and it will DTRT. (Presuming this goes well, we should consider using URI schemes in the other transports too; this would also let us drop the current "type" field assertions and have transports check they recognize URI schemes, which makes a lot more sense when we get to configurable/pluginable dispatches for the IO transport systems.) We've also talked about upgrading data location URIs to more complex structural data than strings can handle, but without any use cases or specific demos on the table yet, well... URIs still work well enough for jdbc after a decade or so, and we're not so different.
Setting up S3/AWS permissions models are not handled in this. Those are arcana that should be managed by your organization, not repeatr.