Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Fix fwrite() on non-blocking SSL sockets #2330

Merged
merged 8 commits into from
Mar 14, 2017
Merged

Conversation

bukka
Copy link
Member

@bukka bukka commented Jan 22, 2017

This is still work in progress but want to see what Travis thinks about that.

@bwoebi this should hopefully address https://github.com/amphp/aerys/issues/107 so if you could give a try that would be great.

@weltling I haven't setup my win machine yet so if you have time and can test it on windows that would be awesome. I have got a separate server and client tests in here (you run server first and then client and see what it does ;)) which is a bit more user friendly for manual testing:

https://github.com/bukka/php-util/tree/master/tests/openssl/asyncw

@bwoebi
Copy link
Member

bwoebi commented Jan 22, 2017

I'd love to try it, but I couldn't reproduce the issue back then with aerys locally - so I probably am not a big help here.

@krakjoe krakjoe added the Bug label Jan 23, 2017
@krakjoe
Copy link
Member

krakjoe commented Jan 23, 2017

@bukka @bwoebi is there any one else that could/can reproduce the issue, or provide some kind of feedback here ?

@bukka
Copy link
Member Author

bukka commented Jan 24, 2017

I need to double check few things and do a bit more testing to be more sure that it doesn't break anything. It would be also nice to have some Win testing. Btw. anyone should be able to reproduce with the supplied test. But it's still WIP as more testing required. I will remove WIP from the title and merge it once I'm more sure that it's all fine.

@Fedott
Copy link

Fedott commented Jan 28, 2017

I checked this fix on my case. It's work!

@staabm
Copy link
Contributor

staabm commented Jan 29, 2017

I checked this fix on my case. It's work!

Whats your case? Any details? What/How did you test this patch?

@Fedott
Copy link

Fedott commented Jan 29, 2017

Whats your case? Any details? What/How did you test this patch?
Show all checks

I have problem similar described in issue https://github.com/amphp/aerys/issues/107.
I compiled php in docker container and run own application.
I loaded big files via protocol http2 (used openssl). Files loaded correctly.
Communication over websocket with openssl work too.

@kelunik
Copy link
Member

kelunik commented Feb 24, 2017

@bukka What's the state here?

@trowski
Copy link
Member

trowski commented Feb 24, 2017

@DaveRandom and @brzuchal discovered this same problem last night, and after working with them we also realized that SSL_MODE_ENABLE_PARTIAL_WRITE and SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER were needed to make non-blocking stream writes succeed when given a large buffer. We had no idea this PR existed, and I already committed a patch with essentially the same fix, however with the modes set in php_openssl_setup_crypto. (Setting the modes separately is a leftover from trying to determine if both options were needed, definitely should be one statement).

I apologize for stepping on anyone's toes. How should we proceed? @bukka Feel free to replace what I committed with a single statement or move it to php_openssl_enable_crypto if that's where the options should be set.

@bukka
Copy link
Member Author

bukka commented Feb 24, 2017

I want to look on the whole thing a little bit later in March and think about a better implementation in general.

About this one. I have tested it with non blocking on Linux which seems fine however haven't done much testing with blocking to be sure it has no influance (meanig it really won't get set). In general these modes could possibly cause some BC issues so we should be careful. That's why I set it just for non blocking where it doesn't work already.

@trowski have you done some testing with blocking - I haven't checked your tests properly so not sure if it has any influance but code seems to set it always. Also have you been testing that on Win to be sure that it works fine in there as well? I think it's always better to test it properly rather than break some apps... We should be really sure about all possible consequances before we ship it though... If you are not 100% sure, might be bettet to revert it and wait a little bit once I have got a bit more time in March.

@trowski
Copy link
Member

trowski commented Feb 24, 2017

@bukka I tried the same test with a blocking stream and didn't have any issues. However, I agree that the fix should only be applied to non-blocking streams. Since the issue only exists with non-blocking streams there's no reason to inadvertently change behavior for blocking streams.

I think the chances of BC breaks applying the fix to only non-blocking streams is minimal. Behavior below the buffer size causing issues continues to be the same and of course the patch fixes the large buffer issue.

I have not personally tested on Windows, though the tests on appveyor seem to pass.

I'd like to apply the fix to only non-blocking streams and ship this in the next point releases. This is an issue we're having now and is preventing usage of non-blocking streams for messages over about 128K.

@bukka
Copy link
Member Author

bukka commented Feb 25, 2017

@trowski You can leave it. I will rewrite it with my version when I'm back as it's cleaner and just for non-blocking. I realised that the next release should be tagged in the week starting on 13th March so I have enough time to look into it.

Btw. we really need to stop adding test for TLS 1.2 only as it's not good for future testing (e.g. once the 1.3 support is released) and there should be no issue with lower versions anyway for this particular. I actually think that our setting of a single constant is just something that we should deprecate and rather add min and max versions. But that's a completely different issue of course.

Anyway don't worry about anything. I will take a look in March and will then close this PR as well ;)

@nikic
Copy link
Member

nikic commented Mar 14, 2017

Please squash PRs with dirty history before merging.

@bukka
Copy link
Member Author

bukka commented Mar 14, 2017

It's not a dirty history and it's rebased which is actually what our official flow is:

https://wiki.php.net/vcs/gitworkflow

@kelunik
Copy link
Member

kelunik commented Mar 14, 2017

Will this be in the release this week or the next release?

@nikic
Copy link
Member

nikic commented Mar 14, 2017

By dirty history I mean that it contains multiple commits performing "fixup" of previous commits, e.g. to fix indentation, slightly change style etc. If they are not (interactively) rebased or squashed before merging, these individual commits show in blames etc. When blaming code I want to see "Fixed bug #XXX" and not click through a series of "fixed indent" and "tidy code".

@bukka
Copy link
Member Author

bukka commented Mar 14, 2017

@kelunik It should be in the next tagged release which should be 7.0.18 and 7.1.4 (at least that's where I updated NEWS)

@nikic well not all of them are for the same things. For example the indent fix and skip is for another test that was already there. Also there is a clean up of the mode setting which is not related. Considering that squashing is not really our requirement by the linked workflow, I don't see any reason for it in here (read I just can't be bothered to squash only some bits... :) ).

@bukka
Copy link
Member Author

bukka commented Mar 14, 2017

@nikic Also to see all changes you can easily view the merge commit which gives you exactly that so I don't really understand what the advantage of squashing is...

@nikic
Copy link
Member

nikic commented Mar 14, 2017

@bukka In this case the merge commit will not show up in (non-global) commit logs and blames (trivial merge).

I'm not saying that everything should be squashed together, but the history should be meaningful. It's fine to keep the reindent of an existing file as a separate commit, but having multiple separate commits for small tweaks to code that was introduced in the same PR is not meaningful.

@weltling
Copy link
Contributor

@bukka could you please check the AppVeyor fails on ext\openssl\tests\bug72333.phpt?

Thanks.

@bukka
Copy link
Member Author

bukka commented Mar 15, 2017

@weltling Looks like win in AppVeyor kills the socket quite quickly so this sort of test won't probably work there. Might be probably better to skip it on win or in just AppVeyour if it works on your local win machine. If you have a little bit of time to run the test locally on Win and possibly double check the functionality (if the fix works) - if you are able to give a try of the piece of code in https://bugs.php.net/bug.php?id=72333 that would be awesome.

I will hopefully have time to finally properly setup my PHP Win env in the next few weeks but definitely not this week so your help would be appreciated!

@weltling
Copy link
Contributor

@bukka i was previously checking the test locally, same fails there about "connection was forcibly closed". Timely it looks busy as well these days, but willing to help or support the fix testing, ofc. Just let the code from the BT run, here's what it gives out:

int(0)
string(0) ""
int(65536)
int(0)

Not sure it's right, as the second write gives 0 back. Clear, it'd be better to know exactly, what goes on, before adding skips.

BTW you can also signup to AppVeyor locally using the GitHub account. Then it'll pickup pushes to your fork.

Thanks.

@nikic
Copy link
Member

nikic commented Mar 15, 2017

Looks like the test is also intermittent on Travis, e.g. failing in https://travis-ci.org/php/php-src/jobs/211471345

========DIFF========
001+ Warning: fwrite(): SSL: Connection reset by peer in /home/travis/build/php/php-src/ext/openssl/tests/ServerClientTestCase.inc(96) : eval()'d code on line 11
========DONE========
FAIL Bug #72333: fwrite() on non-blocking SSL sockets doesn't work [ext/openssl/tests/bug72333.phpt] 

Of course, depending on frequency this may be okay, server tests tend to be unreliable.

@bukka
Copy link
Member Author

bukka commented Mar 16, 2017

@weltling @nikic Should be sorted by 0c8ad36

@bukka
Copy link
Member Author

bukka commented Mar 16, 2017

@weltling That result for the BT is expected. 0 is there because socket is not available for read. The main thing is that there is no error ;)

@weltling
Copy link
Contributor

weltling commented Mar 17, 2017

@bukka Thanks for the patch. When i run the test locally on Windows, the test blocks now. Seems there's still some test issue, but might be a local issue, too. From the test code, and what you've expressed in the last comment, shouldn't we also break when (edit) false === $result? From the test - the client would've anyway written less that server wanted to read, so it is likely the client loops through select endlessly, as the server didn't close the socket. I'll be able to check more close on monday, probably. Still, what was intended to be checked, from the current test the actual size the client wanted to write doesn't matter.

Thanks.

@weltling
Copy link
Contributor

A quick try yet - decreasing the select timeout to 1 second makes the test to pass. Seems the client indeed hangs, but not in the loop - in the select. Still, one second probably too slow for the test run :/

Thanks.

@bukka
Copy link
Member Author

bukka commented Mar 17, 2017

@weltling I have committed another fix 53e2c91 that should consume data by server so the client should be able to write more hopefully. AppVeyor seems fine but your local setup might have a smaller queue so thought that it might be the issue. Let me know once you get time to try it. ;) Thanks

@weltling
Copy link
Contributor

@bukka thanks for taking care. Seems this is a tricky case :/ I've still have no stable result locally with ext\openssl\tests\bug72333.phpt. Now was checking with 32- and 64-bit and OpenSSL 1.1 on master, whereby it shouldn't matter with the exact versions. Locally, i still have the test hanging sporadically, but running through some times with a warning "completed successfully", sometimes passing. I'm going to observe it more the comming week. What were the exact queue setting you mentioned?

Thanks.

@weltling
Copy link
Contributor

@bukka, thinking more - are you sure this high timeout on select is needed? OFC it is hard to check what exactly happens on CI, but i guess there's a good potential it hangs elsewhere, not only on Windows, even then it passes afterwards. For now it looks like some dead lock in the network related test code. I guess the big timeout is the issue, will experiment on that yet.

Thanks.

@bukka
Copy link
Member Author

bukka commented Mar 20, 2017

@weltling I lowered the timeout to 1 sec as it should be the top when this happens anyway and if it goes over 1 sec, than it would hang for longer. It is kind of strange that it's not visible in AppVeyor but you get it.

Just to explain what the test is about. The point is to make write in the client (used by OpenSSL) write less than requested (so the returned size is lower than the supplied data size). It is more difficult to create such case on localhost because OS usually caches (at least linux does) it in the network buffer (the queue that I was talking about) so there must be enough data supplied by the client that are not consumed by the server (that happens when there is a usleep after short read). In your case I guess that the client is not capable to continue to write anything even though there should be enough space in the queue. Not really sure about the exact reason though but currently seems to me like something Windows related as it does not happen on Linux but might be wrong.

@weltling
Copy link
Contributor

@bukka this looks more or less stable now, tested quick with the current 7.0 and master without rebuild. Probably it'll be even better do have a smaller select timeout. It looks like a bit wrong assumption in the test code, in the current and in the previous version as well, that leads to the dead lock. The principal fact is, that select blocks, disregarding whether the socket itself is non blocking.

Thanks for the explanation, likely the behavior to enforce is indeed a tricky thing, as the I/O buffers are controlled by the systems. It is also likely, that AppVeyor was already doing same - hanging for the full timeout and then running the test through. That's what i meant by telling it's hard to say what happens on CI. As the only checked part is that the operation is finished without warnings, the test could still pass. The previous test version was actually doing same locally - let it run, it exhausts the timeout and the test runs through. I couldn't say whether this was same in the very first test version, probably it were just hanging too long so i didn't wait for the timeout and thought it just hangs. The diff to your new version is anyway that i don't see any extra occasional warnings anymore, teh "operation completed successfully" warnings i was mentioning above. We still could decrease timeout, but in general the test should be fine now.

Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants