Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: correct highWaterMark default issue for large sftp downloads. #42

Closed
wants to merge 1 commit into from

Conversation

paulroth3d
Copy link

High level, this fix addresses an issue where when downloading a large file
(greater than 65535 bytes), the 65536th byte is dropped on each buffer)
Meaning zips - etc cannot be opened.

The highest number should be 65535 not 65536.
The reason is that we are dealing with unsigned digits
and 65535 in binary is 0b1111111111111111
(this is the full integer with all 1s)

65536 in binary is just one more 0b10000000000000000

When downloading large files, what happens is that the buffer
THINKs it stores 65536 bytes, but actually only stores 65535
(and the 65536th byte is lost for each buffer)

Correcting the issue to 64*1024-1 (65535) makes sure that the
highest number of bytes actually meets what it can store.

The highest number should be 65535 not 65536.
The reason is that we are dealing with unsigned digits
and 65535 in binary is 0b1111111111111111
(this is the full integer with all 1s)

65536 in binary is just one more 0b10000000000000000

When downloading large files, what happens is that the buffer
THINKs it stores 65536 bytes, but actually only stores 65535
(and the 65536th byte is lost for each buffer)

Correcting the issue to 64*1024-1 (65535) makes sure that the
highest number of bytes actually meets what it can store.
@mscdex
Copy link
Owner

mscdex commented Jul 14, 2016

I'm skeptical. Can you provide a way to reproduce the problem?

@paulroth3d
Copy link
Author

thanks @mscdex - I'm afraid I'm not sure what would be the simplest way to give a clear cut way to reproduce the issue. Although I've tried my best below unless you have any better suggestions.

The way we've been reproducing the issue, and verifying the fix:

  • find / generate a zip file that is more than 2 Mb
  • Place the zip up on a server accessible through ssh/sftp to download.
  • convert the generated zip file to hex (to help verify the bytes)
    • in linux: xxd -p test_bytes.zip | tr -d '\n' > orig_bytes_hex.zip
  • download the zip file through the stream with a highWaterMark of 64 * 1024.
  • convert the retrieved zip file to hex
    • in linux: xxd -p test_bytes.zip | tr -d '\n' > orig_bytes_hex.zip
      perform a text compare / diff compare between the two files.

See this guy for example results - http://imgur.com/a/UGisA

(If you'd like to compare the first parts of either of the files, just use:
head -c NumBytes FileToCheck.zip)

Here is an example text for line 2184 - 2187 of the correct text
17372b1443db27bfb4a6ebb6c5d016036d8ba14d76d8a232cd7d372957d3
9fdbdef49081b6e4d0a61cda9e5fd4d643fb0740c6178166c1b6d326076d
b68cb65b469b1cb4a588b614d11e2da036e73629689382b603b484d29adb

Here is the same lines for a file retrieved from the download
17372b1443db27bfb4a6ebb6c5d016036d8ba14d76d8a232cd7d372957d3
9fdbdef49081b6e4d0a61cda9e5fd443fb0740c6178166c1b6d326076db6
8cb65b469b1cb4a588b614d11e2da036e73629689382b603b484d29adbe6

The simplest way to test this, without putting in a large set of code is through the
https://www.npmjs.com/package/scp2
which references
https://www.npmjs.com/package/ssh2
which in turn references this module
https://www.npmjs.com/package/ssh2-streams

Let me know if you need a more specific set of code
and I'll try to provide it (but it might take a bit without any help)

"use strict";
var Q = require('q');
var config = require('config');
var client = require('scp2');

var credentials = { host: 'ftp.example.com', username: 'lordDarkHelmet', password: '1234' };
fileFrom = '/path/to/file.zip';
fileTo = './file.zip';

function downloadFile(credentials, fileFrom, fileTo) {
    var deferred = Q.defer();
    var request = credentials;
    request.path = fileFrom;
    client.scp(request, fileTo, function (err) {
        if (err) {
            deferred.reject(err);
        }
        else {
            //-- always succeeds, but the file is bad.
            deferred.resolve();
        }
    });
    return deferred.promise;
}
downloadFile( credentials, fileFrom, fileTo );

comparison on file:
See this guy for example results - http://imgur.com/a/UGisA

The default used for the highWaterMark from 65536 to 65535 resolves the issue
because once the buffer doesn't overflow.

@mscdex
Copy link
Owner

mscdex commented Jul 14, 2016

Can you give an exact file size that is causing you issues? Does this happen every time?

@paulroth3d
Copy link
Author

Yes this was happening every time.
The test file we were having the issue with was
27,103,645 bytes (27.1 MB on disk)

@mscdex
Copy link
Owner

mscdex commented Jul 28, 2016

I just tested with a randomly generated file with exactly that size. Transferring the file directly with ssh2 and using sftp.createReadStream() resulted in matching files every time that I tried. I even tried forcing the internal buffer to fill up to the high water mark and simulating a slow link, but the files still match.

@mscdex
Copy link
Owner

mscdex commented Jul 31, 2016

It'd also be interesting to see if you have the same problem when using something like sftp.fastGet() instead of sftp.createReadStream().

@BorePlusPlus
Copy link

BorePlusPlus commented Sep 23, 2016

As mentioned in the issue related to this PR I experienced the same. Here is code that I used and some logs of behaviour before and after applying the highWaterMark "fix".
To produce logs I added a cheeky console.log at this location.

It happened every time for me for files of various sizes.

The client OS is Linux (ubuntu 14.04 3.16.0-77) running node 4.4.0, not sure what is running on server (not one of our boxes).

@mscdex
Copy link
Owner

mscdex commented Sep 24, 2016

I'm still not able to reproduce the problem, here is what I get with the same log statement placement (and same sized file):

Ready!
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 65536 thisPool.length: 65536
bytesRead: 58831 thisPool.length: 65536
bytesRead: 0 thisPool.length: 65536

@BorePlusPlus If you set debug: console.log in your connection config, what server ident is reported near the beginning of the output?

@BorePlusPlus
Copy link

This is the reported ident:

DEBUG: Remote ident: 'SSH-2.0-1.82_sshlib GlobalSCAPE'

@mscdex
Copy link
Owner

mscdex commented Sep 26, 2016

@BorePlusPlus Looks like some commercial SFTP server that does not have a readily downloadable trial/demo version. Am I correct in assuming that 32 * 1024 also doesn't work for highWaterMark? Either way it definitely seems like a bug with the remote SFTP server implementation.

@paulroth3d Do you happen to know what the remote server ident was in your case?

@BorePlusPlus
Copy link

BorePlusPlus commented Sep 26, 2016

It all works tip-top as long as highWaterMark is lower than 64 * 1024 as per logs.

@mscdex
Copy link
Owner

mscdex commented Sep 26, 2016

Yeah I'm not sure there's much that can be done by this module to avoid that since it's ultimately a server-side issue. We could check the server ident, but I don't really feel comfortable magically changing the highWaterMark out from under the end user if they end up specifying 64 * 1024.

If we default to 64 * 1024 - 1 instead, then who's to say some other server implementation somewhere else would have a problem with that kind of value? Compatibility is tricky and I think at some point a line has to be drawn.

@evansnicholas
Copy link

I also ran into this issue while downloading large XML files. The dropped bytes meant that the downloaded file could no longer be parsed as valid XML. Changing the highWaterMark value fixed the issue.

@mscdex
Copy link
Owner

mscdex commented Oct 10, 2016

@evansnicholas Out of curiosity, what is the remote server's ident if you enable logging?

@evansnicholas
Copy link

This is the remote ident:

DEBUG: Remote ident: 'SSH-2.0-billsSSH_3.6.3q3<CR><LF> COMPANY_NAME'

@mscdex
Copy link
Owner

mscdex commented Oct 10, 2016

@evansnicholas Well that's interesting .... that's the example ident used in the SSH Transport RFC :-)

@cherchyk
Copy link

Any updates on it? We had an issues loading big files and correction proposed by @paulroth3d fixed the problem.

@mscdex
Copy link
Owner

mscdex commented Dec 22, 2016

@cherchyk Are you using ssh2-streams v0.1.16? If so, what is the server ident?

@cherchyk
Copy link

I use 0.0.23

but problem exists in current version too. Please review paulroth3d's explanation.

@mscdex
Copy link
Owner

mscdex commented Dec 26, 2016

Unfortunately until I can reproduce the issue myself, there's not much I can do to pinpoint the actual problem.

@cherchyk
Copy link

cherchyk commented Dec 26, 2016 via email

@BorePlusPlus
Copy link

@cherchyk Workaround for servers having issue with 64k high watermark is to just use a different (lower value) e.g.:

var stream = sftp.createReadStream('directory/file.zip', { highWaterMark: 32 * 1024 });

as per

https://gist.github.com/BorePlusPlus/c9186c8feb9902f84da3e35f313a43a8#file-sftp-download-js-L18

@cherchyk
Copy link

cherchyk commented Jan 26, 2017

Thanks @BorePlusPlus ,

why this is not default setting?
in most cases we don't know how big is the file. 5 or 500mb

@BorePlusPlus
Copy link

BorePlusPlus commented Jan 30, 2017

@cherchyk Well, from conversation above, you can read that this is not the default setting as it is a bit arbitrary. There may as well be servers out there where 32k high water mark wouldn't be a right setting. Essentially it is a server side issue.
On the other hand, it does seem like lowering the high water mark usually solves the issue, so it just might be pragmatic to have lower default, but this is not my call. I am just a user.

Regarding file size. If your server works with lower setting, then it doesn't matter how big your files are.

@mscdex
Copy link
Owner

mscdex commented May 28, 2017

Closing this for now. If someone is able to create a setup that I can use to easily reproduce the problem locally (e.g. docker setup or otherwise), then I am willing to give that a shot. Otherwise there is not much that I can do at this point. The highWaterMark is already configurable if you want to override the default.

@mscdex mscdex closed this May 28, 2017
@orhankutlu
Copy link

@mscdex i am having the same issue.

@BorePlusPlus i tried to set highWaterMark to 32 * 1024, still the same. last bit is always dropped.. I am really stuck. Do you have any suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants