Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFTP uploads randomly fail #1851

Closed
daanhaitsma opened this issue Oct 5, 2022 · 16 comments
Closed

SFTP uploads randomly fail #1851

daanhaitsma opened this issue Oct 5, 2022 · 16 comments

Comments

@daanhaitsma
Copy link

daanhaitsma commented Oct 5, 2022

We're using phpseclib in Laravel through the thephpleague/flysystem-sftp-v3 package and after updating we have encountered a lot of errors. We have previously created an issue at Flysystem as well

Quick overview of the versions:

Before After
phpseclib 2.0.32 3.0.15
flysystem 1.0.22 3.5.2

From the flysystem package, we are randomly getting an UnableToCreateDirectory exception, when directory has already existed for years. Now after storing the logs from phpseclib, we believe we've found some interesting leads.

  • A successful upload sends NET_SSH2_MSG_CHANNEL_OPEN 1 time, receives NET_SSH2_MSG_CHANNEL_OPEN_CONFIRMATION 1 time
  • A failed upload sends NET_SSH2_MSG_CHANNEL_OPEN 5 times, but receives NET_SSH2_MSG_CHANNEL_OPEN_CONFIRMATION only 4 times

These are the parts of the logs that I think are most interesting and show a clear difference:


Now our next question is: What would cause NET_SSH2_MSG_CHANNEL_OPEN to be sent twice in a row (in the beginning, before NET_SSH2_MSG_GLOBAL_REQUEST), without receiving either NET_SSH2_MSG_CHANNEL_OPEN_CONFIRMATION or NET_SSH2_MSG_CHANNEL_OPEN_FAILURE in between?

@terrafrost
Copy link
Member

Could you do print_r($sftp->getErrors()) next time this happens and post what it says?

@daanhaitsma
Copy link
Author

Thanks for that! We've just released a change to log the result of this method, but unfortunately it only shows one error, that was also in the SSH logs. This is the same for both successful and failed uploads:

SSH_MSG_GLOBAL_REQUEST: hostkeys-00@openssh.com

@terrafrost
Copy link
Member

So I feel like this code snippet (SFTP::partial_init_sftp_connection) is prob being hit:

        $this->channel_status[self::CHANNEL] = NET_SSH2_MSG_CHANNEL_OPEN;

        $response = $this->get_channel_packet(self::CHANNEL, true);
        if ($response === true && $this->isTimeout()) {
            return false;
        }

$this->get_channel_packet should return true but I'm wondering if there's a timeout going on as well.

Like in 3.0 every public function calls precheck() first. precheck() does this:

    private function precheck()
    {
        if (!($this->bitmap & SSH2::MASK_LOGIN)) {
            return false;
        }

        if ($this->pwd === false) {
            return $this->init_sftp_connection();
        }

        return true;
    }

init_sftp_connection, in turn, calls partial_init_sftp_connection.

So that'd explain you're having multiple NET_SSH2_MSG_CHANNEL_OPEN'S / NET_SSH2_MSG_CHANNEL_OPEN_CONFIRMATION's in your log.

Now, the question is... why would the timeout flag be set?

A few possibilities exist (since the log that you posted doesn't any overwhelmingly slow packets):

  • maybe flysystem is setting the timeout to some unreasonably low amount. $ssh->setTimeout(0) means that there isn't a timeout but if it were $ssh->setTimeout(0.1) that would explain stuff
  • maybe other SFTP sessions are being ran through the same object somehow and that's effecting the timeout

Anyway, to confirm this, I guess the first thing I'd do is echo $sftp->isTimeout() ? 'yes' : 'no'; after the first time you have the issue.

Assuming that that outputs yes I'd prob modify SSH2::isTimeout() to output $this->curTimeout.

@daanhaitsma
Copy link
Author

daanhaitsma commented Oct 7, 2022

Thanks for your input, this is some very valuable information!

Since this only happened after the 3.0 update, the precheck could be interesting indeed.

Regarding the two possibilities you mentioned:

maybe flysystem is setting the timeout to some unreasonably low amount. $ssh->setTimeout(0) means that there isn't a timeout but if it were $ssh->setTimeout(0.1) that would explain stuff

I trust that Flysystem correctly applies the configuration we give it, so the timeout should be 20 seconds. Given the short times in the logs, I think that should be plenty

maybe other SFTP sessions are being ran through the same object somehow and that's effecting the timeout

This is an interesting one and could definitely be the case, as the Laravel service container apparently shares its bound instances across all jobs running in the same queue through Horizon.

I'm not sure if the storage connection is actually stored in that service container as well, but I'll look into that right away as well as logging the ->isTimeout() and get back to you.

@daanhaitsma
Copy link
Author

Back to say that you are right! It does say that there's a timeout whenever this fails

@terrafrost
Copy link
Member

Try https://github.com/terrafrost/phpseclib/tree/timeout-debug and lmk what you get as the output.

Here's the specific commit that's new to that branch:

terrafrost@7485f76

@daanhaitsma
Copy link
Author

Got this log from a failed upload:

Is timeout: yes

Output: 
1. curTimeout - 0.0016241073608398 = 19.998375892639
2. curTimeout - 0.011014938354492 = 19.987360954285
5. curTimeout - 0.35286808013916 = 19.634492874146
5. curTimeout - 0.00028896331787109 = 19.634203910828
5. curTimeout - 2.8610229492188E-6 = 19.634201049805
5. curTimeout - 0.00011801719665527 = 19.634083032608
5. curTimeout - 0.040347099304199 = 19.593735933304

And this from a successful one, 30 seconds later:

Is timeout: no

Output: 
1. curTimeout - 0.0022671222686768 = 19.997732877731
2. curTimeout - 0.010595083236694 = 19.987137794495
5. curTimeout - 0.28265309333801 = 19.704484701157
5. curTimeout - 0.00027990341186523 = 19.704204797745
5. curTimeout - 0.0010130405426025 = 19.703191757202
5. curTimeout - 1.9073486328125E-6 = 19.703189849854
5. curTimeout - 0.00066590309143066 = 19.999334096909
5. curTimeout - 0.00030207633972168 = 19.99969792366
5. curTimeout - 0.0004429817199707 = 19.99955701828
5. curTimeout - 0.00034093856811523 = 19.999659061432
5. curTimeout - 0.00030684471130371 = 19.999693155289
5. curTimeout - 0.0004889965057373 = 19.999511003494
5. curTimeout - 0.00021600723266602 = 19.999783992767
5. curTimeout - 0.00021600723266602 = 19.999783992767
5. curTimeout - 0.00029706954956055 = 19.99970293045
5. curTimeout - 5.0067901611328E-6 = 19.99999499321
5. curTimeout - 3.814697265625E-6 = 19.999996185303
5. curTimeout - 0.00027203559875488 = 19.999727964401
5. curTimeout - 0.00042605400085449 = 19.999573945999

Other than the latter obviously having a lot more 5.'s, they look about the same to me. Hoping this tells you more! :)

@terrafrost
Copy link
Member

I apologize for the delay.

Anyway, from what you posted, it doesn't seem like there should be a timeout. Maybe stream_select is acting funky.

I added another commit to that same branch with more debug code.

Thanks!

@daanhaitsma
Copy link
Author

daanhaitsma commented Oct 18, 2022

Ah it does seem like it. We are now indeed getting the stream_select error messages as described in the code comment.

For example:

stream_select(): Unable to select [4]: Interrupted system call (max_fd=29)

It is now also no longer saying there's a timeout. These are the full logs again, though they still seem the same:

Failed upload

Is timeout: no

Output: 
1. curTimeout - 0.0017609596252441 = 19.998239040375
2. curTimeout - 0.011067867279053 = 19.987171173096
5. curTimeout - 0.15409898757935 = 19.833072185516
5. curTimeout - 0.00028181076049805 = 19.832790374756
5. curTimeout - 0.00094413757324219 = 19.831846237183
5. curTimeout - 9.5367431640625E-7 = 19.831845283508
5. curTimeout - 0.00060510635375977 = 19.999394893646
5. curTimeout - 0.00033903121948242 = 19.999660968781
5. curTimeout - 0.00030112266540527 = 19.999698877335
5. curTimeout - 0.00033688545227051 = 19.999663114548
5. curTimeout - 0.00011706352233887 = 19.999882936478
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 4.1007995605469E-5 = 19.999958992004
5. curTimeout - 1.1920928955078E-6 = 19.999998807907
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 5.2213668823242E-5 = 19.999947786331
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 0 = 20
5. curTimeout - 0 = 20
5. curTimeout - 3.6001205444336E-5 = 19.999963998795
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 5.0067901611328E-5 = 19.999949932098
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 9.5129013061523E-5 = 19.999904870987
5. curTimeout - 0.00032901763916016 = 19.999670982361
5. curTimeout - 0.00024199485778809 = 19.999758005142
5. curTimeout - 0.0013229846954346 = 19.998677015305
5. curTimeout - 0.00030303001403809 = 19.999696969986

Successful upload

Is timeout: no

Output: 
1. curTimeout - 0.0018579959869385 = 19.998142004013
2. curTimeout - 0.011698007583618 = 19.986443996429
5. curTimeout - 0.16940808296204 = 19.817035913467
5. curTimeout - 0.00014710426330566 = 19.816888809204
5. curTimeout - 0.0014221668243408 = 19.81546664238
5. curTimeout - 2.1457672119141E-6 = 19.815464496613
5. curTimeout - 0.0023260116577148 = 19.997673988342
5. curTimeout - 0.0017449855804443 = 19.99825501442
5. curTimeout - 0.00024509429931641 = 19.999754905701
5. curTimeout - 0.0007469654083252 = 19.999253034592
5. curTimeout - 5.9604644775391E-6 = 19.999994039536
5. curTimeout - 1.9073486328125E-6 = 19.999998092651
5. curTimeout - 1.9073486328125E-6 = 19.999998092651
5. curTimeout - 2.1457672119141E-6 = 19.999997854233
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 6.9141387939453E-6 = 19.999993085861
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 0.00064420700073242 = 19.999355792999
5. curTimeout - 0.00024199485778809 = 19.999758005142
5. curTimeout - 0.00079011917114258 = 19.999209880829
5. curTimeout - 0.00018596649169922 = 19.999814033508
5. curTimeout - 0.00041794776916504 = 19.999582052231
5. curTimeout - 0.00061702728271484 = 19.999382972717
5. curTimeout - 5.9604644775391E-6 = 19.999994039536
5. curTimeout - 1.9073486328125E-6 = 19.999998092651
5. curTimeout - 9.5367431640625E-7 = 19.999999046326
5. curTimeout - 0.00010013580322266 = 19.999899864197
5. curTimeout - 0.00017094612121582 = 19.999829053879
5. curTimeout - 0.00067806243896484 = 19.999321937561

Thanks again!

@mozex
Copy link

mozex commented May 8, 2023

Hi @daanhaitsma, the same issue is happening to me. Could you let me know if you found a solution to this problem?

@terrafrost
Copy link
Member

@mozex - did you try the same debugging branch that the original poster tried?:

https://github.com/terrafrost/phpseclib/tree/timeout-debug

If not then if you could try it and lmk what the output is that'd be great.

If you are getting the same error as the OP was (stream_select(): Unable to select [4]: Interrupted system call (max_fd=29)) then that would suggest that you're getting a bunch of interrupted system calls. It'd be interesting to know what was causing all those interruptions to happen. I could look into implementing a workaround, as well, but (1) the timeout functionality prob wouldn't work, (2) all the interrupted system calls could have other hard to predict consequences as well and (3) as I've never been able to reproduce this problem it'd be hard for me to actually test anything. Because of (3), SSH access to a server that reproduces the problem would help (if the issue is indeed with stream_select, as it was for the OP)

@terrafrost
Copy link
Member

terrafrost commented May 9, 2023

Assuming the issue is due to stream_select then here's where that error is being generated:

	retval = php_select(max_fd+1, &rfds, &wfds, &efds, tv_p);

	if (retval == -1) {
		php_error_docref(NULL, E_WARNING, "Unable to select [%d]: %s (max_fd=%d)",
				errno, strerror(errno), max_fd);
		RETURN_FALSE;
	}

https://github.com/php/php-src/blob/ab3f584a2575e621a95f458d0f5b99039894924d/ext/standard/streamsfuncs.c#L858-L864

So 4 corresponds to errno, which in turn, I guess, corresponds to EINTR. quoting https://stackoverflow.com/a/14262151/569976 :

The general approach to EINTR is to simply handle the error and retry the operation again

One thing that's unclear to me is if the time elapsed will still be updated. Also, how frequently are you getting these interrupts?

Either way, it'd be cool to have access to a system that reproduces this. And I don't mean SFTP access - I mean SSH2 access. Like I need to be able to run phpseclib from that machine.

Thanks!

@terrafrost
Copy link
Member

Try this branch and see if it helps:

https://github.com/terrafrost/phpseclib/tree/3.0-stream-select-interrupt

@mozex
Copy link

mozex commented May 17, 2023

Thank you for your quick response; I've been trying to reproduce the error since then, but it has been fixed, and I don't know why! I haven't done any debugging on it, and I don't have any more information about it. I waited a couple of days to reproduce it but no luck.

@MrVerifit
Copy link

Hi @terrafrost , hope you are doing well. One of my client also faced this issue. We tried the latest version of V2 and V3 of phpseclib. V2 doesn't report any error and fails while V3 report this error: stream_select(): Unable to select [4]: Interrupted system call (max_fd=29)

Then I applied the fix from this branch: https://github.com/terrafrost/phpseclib/tree/3.0-stream-select-interrupt in both V2 and V3 and tested those versions on client site. Both versions with the fix worked!

@terrafrost
Copy link
Member

Thanks for that! I went ahead and merged it:

5a7569c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants