Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Consider verifying checksum of Openstack Storage upload #236

Closed
wants to merge 1 commit into from

Conversation

rossj
Copy link
Contributor

@rossj rossj commented Jan 16, 2014

Currently, it is up to a user to ensure data integrity by sending an ETag header during an upload or checking it the response. Is there any interest in moving this checking functionality into pkgcloud? Using a through stream we could calculate the checksum during upload and compare at the end.

Questions I still have are:

  1. What is the overhead of calculating the checksum? I would guess very small but don't actually know.
  2. What happens if the checksum match fails? Probably remove the file and return an error.. but what if the the remove operation fails?

@kenperkins
Copy link
Member

Looks like we can't hash a file without storing it in memory or on disk. Which means we'll have to write some code to create a temp file store, and that has problems as well. Not sure I'm in love with that. In memory is a non-starter as most users are running node on limited memory VMs.

The hash itself should be pretty straightforward if we have the file on disk or in memory.

@rossj
Copy link
Contributor Author

rossj commented Jan 14, 2014

I believe we could calculate the hash without temporarily storing the file using something like jeffbski/digest-stream, which calculates the hash of a stream. My understanding of MD5 is that it only needs to buffer 512 bytes of data before it can update its state, discard the data, and wait for the next chunk.

@kenperkins
Copy link
Member

@jcrugzz
Copy link
Contributor

jcrugzz commented Jan 14, 2014

@kenperkins it flushes the buffer on each digest when the end function is called

https://github.com/jeffbski/pass-stream/blob/master/lib/pass-stream.js#L46-L49

@kenperkins
Copy link
Member

If so, lets write up a prototype/branch :) @rossj are you on it?

@rossj
Copy link
Contributor Author

rossj commented Jan 14, 2014

@kenperkins I'm pretty sure that .push() is just passing the file data on to whatever the next stream is. http://nodejs.org/api/stream.html#stream_readable_push_chunk_encoding

_flush() should only be called at the end, when the input stream dies, at which point digest-stream will give us the final hash.

I'll set up a branch and try some things out.

rossj added a commit to rossj/pkgcloud that referenced this pull request Jan 16, 2014

var container = options.container,
success = callback ? onUpload : null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @indexzero should have put a comment here, but this was very deliberate. If you pass a callback into mikeal/request, it will buffer the entire contents of the stream into memory.

That's why we had the code setup to check that if there was no callback provided by the caller, don't pass one down the chain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct here @kenperkins. @rossj we want to preserve that check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kenperkins @jcrugzz I'm pretty confident that mikeal/request does not buffer outgoing data, even if a callback is provided. While #195 mentions both upload and download, the 2 request issues that it references (request/request#639 and request/request#477) only refer to download buffering being a problem.

Case in point, I uploaded a 2.8 GB without issue and the process's memory usage hovered between 90-160 MB the whole time (comparable to current 0.9.0 branch).

Currently the code is always passing a callback down the chain to request so that it can do the hash check on response. If this is still a concern, I can probably do this in request.on('response').

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rossj is correct. Buffering only occurs for downloads. See: https://github.com/mikeal/request/blob/master/request.js#L843-L890

@kenperkins
Copy link
Member

Thanks for the followup @rossj. So looks like we can callback every time in upload.

@kenperkins
Copy link
Member

I finally had a chance to get this branch evaluated on my local machine. Props for getting tests running. I have a few comments that I'll inline.


// The returned checksum does not match what was calculated on client side, remove file
self.removeFile(container, options.remote, function (err) {
if (err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels brittle to me. Obviously our service should be highly available, but the fact that we upload the file, then check and delete feels inverted.

I guess there's not much we can do if you want to not have to buffer the entire file locally to generate the hash as part of the inbound request.

@kenperkins
Copy link
Member

@rossj I think what we'd like to do is close this PR, hoping instead to leverage this code as an example on top of pkgcloud. There are a couple of reasons for it.

  1. There is an inherent brittleness here that makes this potentially troublesome for an end user. For example, if there is a partition after upload, but before we validate the MD5 upload.
  2. Bakes in complexity that is better off being more transparent to the end user. I think your idea was great; take advantage of the MD5 capabilities to verify an upload. The problem is that we can't do this as part of the upload in a streaming scenario, and because of that, it results in some odd post-upload behavior.

Would you be willing to re-work this code into an example that we could include in the openstack storage examples?

@kenperkins kenperkins closed this Mar 18, 2014
@rossj
Copy link
Contributor Author

rossj commented Mar 25, 2014

Hey @kenperkins, I think this makes sense. There are some definite edge cases where there isn't a clear path forward, and these cases are probably best left to the user's implementation.

What exactly do you mean by "partition after upload" in your 1st point? Would CloudFiles potentially move or split the file after upload?

I'll take a look at the examples and try to work this in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants