-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modlwip.c does not properly return POLL_HUP and POLL_ERR socket errors #5172
Comments
Compare #4290 |
@peterhinch - Kudos to you for the finding this issue way back in #4290 and documenting better than I did here. I'm very certain that its the very same issue and fixed for STM32s with the mods made to the related modules. I have not dug into the ESP code base to understand the related behaviors. However, I'm not sure I agree with the expected error returns. Reading the related linux man page, it is pretty clear that the revents shall NOT return the callers event flags. So we shouldn't be getting POLL_IN + POLL_HUP (17) returned. If so, then the ESP stacks are also broken in a non-compliant way. Of course we can deal with that in the upper layers but it isn't a good direction to be moving in. One thing to try with the ESPs is simply adding the POLL_HUP | POLL_ERR flags to the uasyncio poll read and write calls and see how that changes your results. This is a socket ioctl API issue that may be impacting multiple targets. However, the lack of error handling in uasyncio's start_server loop seems to be a global issue. Yes, one can trap that error higher up, but crashing and restarting a server on every connection dropped before accept is not OK. As soon as I have a break here, I will clean all the excess debug notes in the related files and post a complete working example which is AFAIK, tested and ready for merge consideration. |
As a very general comment the testing I and @kevinkk525 performed was to determine the response to dropped connections and, critically, to WiFi dropouts. The latter are commonplace. Any updates should be tested against these conditions in addition to testing against malicious peers. It's possible that uasncio.start_server is only used by Picoweb. We avoid it because of the fragility mentioned above. |
After a few tweaks - there is nothing wrong with uasyncio.start_server, it is no longer fragile. Here's some simple testing tools crafted generate abandoned connections, reset connections etc. The sock_test.py script should be launched as a multiple subprocesses. This will generate a flood of overlapping reset HTTP connections. With randomized delays and multiple generators, all the code race conditions are eventually hit. To simulate the Wifi dropped connections, the scapy library is used with its lower level packet manipulations. The command-line launch info is commented in the scapy-cAbandon.py script. One of the options here will provide a simple SYN flood attack. But I suggest using the -v True mode for our purposes here. |
Here's a complete working test case with enhanced exception handling. Note that extra poll flags are included to force modlwip.c ioctl poll to return the POLL_HUP and POLL_ERR revents just in case you haven't re-compiled a fixed modlwip.c yet. This makes it "just work" in either case but should be considered debug code and removed at some point. The test case below was run on a STM32F767. Prior testing has been performed with a STM32F429. Scripting and object coding errors were correctly trapped and reported from all layers. Socket state errors were correctly handled (AFAICT) for floods of connection abandons, resets, and simple SYN attacks. In all cases browser was able to access the demo webapp GET / info page unless the stack was "jammed" where it would then wait until the nasty connections flood subsided and then would resume. The memory stats reporting isn't high-fidelity but it does at least show that there isn't any leakage after a lengthy barrage of bad connection behaviors. Compared to where we started, this works for me and IMHO it also addresses the popular WiFi dropout issue in addition to connection resets (at least from the aspect of not chewing up memory and crashing). If we have any other connection states we need to consider, let me know. As a sidenote - I noticed WireShark showing a browser (client) sending TCP keep-alive packets and LWIP (server) appeared to be responding correctly. I know there is an interest in using keep-alives to sustain connections. This is at least a positive indications that some of the mechanisms are in place and functioning. It may just be a matter of LWIP API to expose more of this within micropython environment? P.S. I apologise if there are any issues or offense with the copyright clauses or other header fluff - this is just text/example code. |
I'm looking at your I appreciate it isn't final code but the following two lines in import sys and the final The change seems benign: I can't see it breaking any existing code, although I appreciate it needs firmware changes to achieve its aim. Please let me know if I'm missing anything. I've also taken a quick look at PicoWeb. I notice that you issue reader.aclose() As I understand it yield from reader.aclose() |
@peterhinch - correct, the sys import related to exception stack trace dumps debug code that I deleted. And, as I've been saying, the start_server() function has a race condition between its IORead(s) and s.accept that must be trapped at this layer or we end up crashing the entire server loop over spurious network behaviors. Sure, its a rare event but it certainly would be a head-scratcher for many users. Finally, the changes in PicoWeb appear simple but are a bit trickier. And, AFAICT, you don't need to yield on closing reader polling objects. Actually, doing so in an exception will result in a dictionary key error because of how the yielding works. I wasn't that smart - I added all the yeilds and watched it fail. So, the constructs that are in place will properly handle the exceptions we were able to trigger without internal errors and without hanging the client due incomplete HTTP exchanges. And its not perfect yet. After running all night, the enhanced exception trapping code tripped over an unusual exception type and it halted the execution. My original approach was to wrap all the errors in this middleware and blame the network but @pfalcon has raised the bar and we're trying to provide that solution. Traceback (most recent call last): The offending PicoWeb Line 210 in the demo code is: I was just being curt when I wrote that line last night. So its vulnerable to e being an empty tuple. I suppose we need to beef this up a bit. Keep in mind that, when we simply pass on all errors here, we never crash. So this can't be a major resource issue and may very well just be another socket triggered error type that needs to be added to the exception handler here. I have an idea of how to beef this up and am running it now... |
UPDATE: Made this modification in PicoWeb (and similar in e.args length trap in uasyncio start_server). This should show us what rare event is throwing the error with zero length e.args tuple. It may take several hours or all day to trip this error. I'll post the results when that happens...
|
UPDATE:
Now that we're discriminating on errors down in these modules, we're
uncovering more structural bugs in uasyncio. The rare error I mentioned
is in uasyncio StreamReader.readline() function where an IORead(s)
followed by a none return on the socket readline results in an assertion
failure.
|
Apologies if I'm missing something but I still think issuing reader.aclose() will never close the socket: it will simply instantiate a generator object. The code is: def aclose(self):
yield IOReadDone(self.polls)
self.ios.close() Consider the following: def foo():
yield 1
print('foo') If I paste this at the REPL and issue
it does not print anything, but merely returns a generator instance. |
Understood that it seems odd. But the latest code zip posted has been
running without errors for many hours with no apparent memory leaks.
And, while the test is running, poking at the server from a browser
shows healthy memory stats reported back as a web page. So, my dumb
answer is "well, it works so...."
But yes, we need to parse thru the code and truly understand what's
going on here. If you make it a generator up here with a yield in front
of the IORead(), it will mess with the object map and this results in a
"key missing error". Feel free to insert the yield and watch it fail.
But without the read.aclose, the HTTP client gets hung waiting for
sesssion to complete. The bare reader.aclose() made the server more polite.
I haven't bothered to weed thru the spaghetti to pin down the exact
statement . Right now, I'm looking at boring screens with no reports
after hours of running - no reported unusual exceptions, no obvious
issues. So, for tonight's entertainment, I've hacked some test code
into start_server to show us all the possible exceptions that are being
returned when doing a reader.readline() from a stack/socket connection
that is being abused by the wire. The "assert res is not none" error
being thrown is bothering me and I can't trigger it again so this test
code should make it present itself easier.
And why would the e.args tuple be coming back "empty"? It may be
simpler to dig into the c-code and just read it.
Curses to @pfalcon for raising the bar on exception handling in the
middleware :-) No worries, with multiple folks looking at this now
we'll make it right and proper. micropython/pycopy/fast_io will be all
the better for it, IMHO.
|
I can't see that the line reader.aclose() is doing anything useful. It's instantiating a generator and discarding it. Commenting the line out would prove the point one way or the other. |
@peterhinch ...because you're correct. I had a prior test case where the web browser was hanging waiting for the server to finish. Adding these appeared to change the behavior and the remote client was able to happily complete its page rendering. Its likely that it they just slowed micropython down with enough of a delay for more bytes to get out on the wire before the writer.aclose killed the outbound packet in LWIP. I'll have to build back up to that more complex case and see if we can re-create it. Meanwhile, I've commented out the useless reader.aclose()s but will keep in mind that we may want to add a brief delay in the exception handler or more directly check status to permit awrite()en data to get out onto the wire before the socket evaporates. That's the best I've got on that at this time. The zip'd project below is where I've ended up after a more careful trace of the exceptions thrown at the various levels and locations. The illusive e.args empty tuple error was only seen twice and I couldn't determine where it may have come from. So, I've added some traps (and made it non-fatal) so we get more details when it ever happens again. It doubt we're 100% done with this yet but its running annoyingly crash-free at the moment which may sound odd but is not. Note: My primary long-run test at the moment is the socket reset because its pretty straight-forward. With the default STM32 socket limit of qty 5, I'm running three subprocesses of the socket_test randomly reset HTTP GETs. I tried adding more data after the HTTP header in the request but that didn't do anything interesting (ie no errors thrown and only slowed the connection rate down) (this now has more substantial changes plus some added cosmetic alterations for readabilty) |
I've looked at the uasyncio changes. delay = int(time.ticks_diff(t, tnow)) and l = int(len(self.runq)) Does the added except Exception as e: ever get called? I appreciate this may be debug code but if it gets called I would suspect that the underlying cause is an error in Re In except Exception as e:
if len(e.args)==0: Are you getting exceptions without args? If so I think we need to find out where these are coming from as they may be indicative of a MicroPython bug. My understanding is that Python exceptions always have an args tuple with at least one element. Unless there is a legitimate case of which I'm unaware? Your zipfiles don't include your changes to |
@peterhinch - Yes this is intended to be and example to explain in code
what I did do well with words. I'm not trying to contribute to the
actual code but to help find critical issues and fixes. Turning this
into prose is ideally left to the core team here and is probably overall
most effective use of our time since style and opinions can slow things
down. Accordingly, I'm not submitting my heavily rewritten version.
I agree, I don't believe any required changes were made to core.py.
* The use of the single letter lower case "L" for variable names was
changed in my copy to avoid it looking like number "1" in certain
fonts. Just no.
* The int() wrapper is left-over from something else - I forgot to
delete it. Should remove it.
* The fatal exception trap highlighted with ### is the one and only
suggested non-bug fix in core.py. This clause is there to help me
find really some of bad errors in co-routines written later. The idea
(for myself) is to add a "decoration" to the stack trace that will be
printed out further up the stack. This is so I know that a co-routine,
being processed by the core.py run_forever loop, caught an exception.
It could be prefaced with a log.debug(). But, I don't really want the
spam of all the other debug messages. So perhaps another log level? Or
delete it. I'll keep in in my own for use when integrating a lot of
co-routines. There are many other log.debug messages in uasyncio that
are, frankly, a bit too much chatter and seem to be suited to use by the
original author (ie - need to be heavily pruned). Also, if they are to
remain, then we'd need a lot more, because all code paths don't have the
same level of debug messaging in all similar functions... But, there's
my style statement creeping in- so I'll stay out of that... its not my
repository; not my prose.
* Yes, several times yesterday, the test bench threw errors in clauses
that tried to determine the length of e.args() - stating that the tuple
was none (ie empty). I was trying to home in on the elusive bug but
with the code and timing alterations, it hasn't recurred. This is part
of the reason I've rigged all those exception trip-wires and print
statements. Unfortunately, the exception thrown by trying to len(e.args)
masked the prior exception. Agreed that micropython itself may have
erroneously tossed this at us and may do so again. Since its so rare
and needs to be reported, perhaps we should just make it fatal and make
the printed message clear about that?
* The modlwip.c diff has been posted as an issue on both pycopy and
micropython as properly as I could and as you suggested under
micropython as a modlwip.c issue. I've compiled/tested the altered
version for STM32F429 and F767 with the desired results. I did change
the names of the variables (flags, res) to (events, revents) to match
the Linux man pages and other similar systems. I prefer to have strong
name/word connections so skimming docs is quicker. The best person for
looking at modlwip.c is Damien. I'm not sure why he hasn't picked it
the issue with the diff and a ribbon on it? Nevertheless, the changes
are trivial. Because it is an LWIP api and and a critical ioctl, I
walked through the logic line by line (offline) with someone else before
compiling. I'm satisfied that the changes there are sufficiently tested
and are merge-worthy, just need the author's blessing.
* In my parallel but different version, I've added explicit
gc.collection in the start_server loop. This seems to be the most
efficient place to put an explicit collection. The entire memory
fragmentation issue with micropython is on my list of major usefulness
problems so this is some initial pokes to characterize some things.
Later, I'll have a look at the memory issues with micropython. The
impact on system robustness is severe. I've seen many of the prior
discussions and understand that we can't pork up micropython with a lot
of memory tracking structures or scrubbing code - not in these tiny IoT
platforms. But we need practical solution. I may have to implement a
stability improvement for my current application project. But I'll save
discussing my ideas for memory stabilization for a later time and
different thread. Just another puzzle to solve.
* FYI - the zip example ran all night while being loaded with a
simultaneous beating from three subprocessed reset attacks and a barrage
of scapy socket abandons. I added code to blink the red led for errors
and the green led for successful web pages served. With those, its easy
to see when the stack is "jammed" and no sockets are available (they'll
eventually timeout). Even with all that, I could still (with effort)
get a browser to find a hole to get out the memory stats. So, with
clean up and promoting of prints to log messages (or delete them), and
with a bit more review, I think the uasyncio and PicoWeb mods are also
close. I'd think @pfalcon will want to study the impact of the
exception handling or at least exception sensing in the middleware to
make sure it fits within the goals of pycopy project.
* In my pocket-copy of PicoWeb, I added a caller option in
picoweb.run(...., run_loop=True). This defaults to the published
behavior but it let's me turn off the last ~four lines where the task
loop is created. In my uses I have my own tasker.py module that
auto-registers tasks from a yaml config file and the collection of
picoweb/web-apps are just one of the tasks. I don't know if @pfalcon
wants to go that way with his PicoWeb so I didn't push that tweak here.
* I can't say this enough - it surprises me that so few seem to realize
the power of uasyncio and PicoWeb combined on top of
micropython/pycopy. And I'm thankful for the author and those who
supported bringing these to the mpy world. These two modules make a
micropython IoT device many times more useful. I've written
similar-effect code for other applications of mpy but these modules make
it far easier to craft co-routined applications. Future work on these
could include speed-ups, dense/smaller code, better memory usage - or
even moving the core.py into c-code (since a lot of execution time is
spent in run_forever). That would be huge in many ways.
|
Thank you for that excellent summary. It would be good to know where in the firmware that strange exception originates, but I appreciate you may not have time to follow this up. Here is my take on what should happen next. modlwip.cRegarding the lack of official response to #5172 I know that @dpgeorge and @jimmo are concerned about this and I'm sure they would welcome a PR. Submitting a PR is the way to get your code reviewed and implemented. I suggest reading the code conventions guide as Damien is keen to maintain consistency. If your changes are minor compliance will be easy. uasyncioI will implement the error trapping in my fast_io uasyncio fork. I'm not sure what to suggest for official micropython-lib. We could submit a PR, but in the past these tended to be ignored. I suggest we await the response to a modlwip PR: on implementation we can see if a uasyncio PR would be welcome and we can raise it. I'd be happy to do this if you'd rather - but only if it seemed likely that it would get attention from the maintainers. picoweb and pycopyPaul is the owner/maintainer. Hopefully he will respond to PR's for picoweb, modlwip and uasyncio. It's entirely up to you how or if you pursue this. My interest is solely with official MicroPython. |
@peterhinch - Thank you for the link to the code conventions guide.
I'll review and try to adjust accordingly. Not sure if I'll get to a PR
soon on this due to project deadlines. We'll see...
Keep in mind that these changes probably should be all at once or with
modlwip.c last. Otherwise, activating unsolicited HUPs and ERRs will
break unprotected uasyncio and PicoWebs. Adding the extra event flags
in the uasyncio poller ioctl calls is a benign workaround to decouple
the changes required and it could be commented for removal after
modlwip.c is fixed. I'm hoping to get these into all three
repositories. Actually, I have never submitted a PR - previously,
@dpgeorge worked the patches with me testing. Your advise is well
taken, unless you wish to do the PRs yourself, I'll give submitting via
PRs a n00b shot. IMHO, it doesn't matter much to me who writes ths code
- it matters more who's using it. ;-P
P.S. I'm curious about your fast_io branch. Sorry, I didn't include it
in the issues postings. It appears you're working the asyncio and other
issues that are very interesting and useful.
|
@t35tB0t As I mentioned on micropython/micropython-lib#353 (comment) I'm happy to do the work here to make this a PR and get it submitted. Thanks for posting the diff, I think I can take if from here if you want? |
@jimmo - Certainly yes! I was waiting until we had a clear path and
solid mods before bothering you. Now we do - and there you are. I
probably should get to doing PRs myself but maybe not this month
(onerous looming project deadline). As you've said, this needs to be
coordinated with each repository owner(s). And I'm happy to have the
help. Tell me if you need anything more that we have here. The code is
stable but, as @peterhinch has indicated, will need some debug code
trimming plus some conventions editing. I will gladly test the final
version(s) as best I can here.
|
I've started work on fast_io and have some questions of detail re
# Code omitted
s.listen(backlog)
try:
while True:
try:
if DEBUG and __debug__:
log.debug("start_server: Before accept")
yield IORead(s)
if DEBUG and __debug__:
log.debug("start_server: After iowait")
s2, client_addr = s.accept()
s2.setblocking(False)
if DEBUG and __debug__:
log.debug("start_server: After accept: %s", s2)
extra = {"peername": client_addr}
# Detach the client_coro: put it on runq
yield client_coro(StreamReader(s2), StreamWriter(s2, extra))
s2 = None # From Paul's code.
except Exception as e:
if len(e.args)==0:
# This happens but shouldn't. Firmware bug?
# Handle exception as an unexpected unknown error:
# collect details here then close try to continue running
print('start_server:Unknown error: continuing')
sys.print_exception(e)
if not uerrno.errorcode.get(e.args[0], False):
# Handle exception as internal error: close and terminate
# handler (user must trap or crash)
print('start_server:Unexpected error: terminating')
raise
finally: # From Paul's code
if s2:
s2.close()
s.close() As discussed your [EDIT] |
reply:
1. The StreamReader.readexactly() was not tested - the change from
read(n) to readline(n) was a 4AM coding error - the readline() elsewhere
was changed to read() for a socket behavior test but the manual
reversion landed in the wrong place. Sorry for being sloppy. I
shouldn't code when I'm drifting in and out of asleep... No - that's
OK. I just shouldn't impose that upon you.
2. The log DEBUG calls are not related to the needed fixes - we'll want
to change these to conform with the libraries as needed.
3. An exception in start_server is probably only happening before the
socket is opened/accepted. So it wouldn't need to be closed. That's
probably why the testing doesn't crash. But this relies on the callback
functions properly closing the sockets so and extra-sure close here is a
good idea as long as it doesn't throw an untended exception. The empty
tuple e.args was actually thrown twice in picoweb:handle(). But I
haven't been able to re-create it. And the actual exception stack trace
was over-written by the error thrown when trying to access e.args[0] in
the exception handler. All further attempts to trap this error failed
to catch this illusive bug. If I ever have something more definitive,
I'll post.
|
Looking at the docs for
It looks like we shouldn't be explicitly registering this eventmask. Can I suggest that you just register |
RE: Can I suggest that you just register |select.POLLIN| and
|select.POLLOUT| as per the original code and verify that |POLLUP| and
|POLLERR| still work?
Well... that's the issue isn't it? modlwip.c as-is absolutely will not
ever return unsolicited errors unless you stuff the HUP and ERR flags
into events input. The reason I included the flags in the demo code is
so that folks can test the fixe to uasyncio without modifying
modlwip.c. I've run it both ways. Now that my modlwip.c is fixed,
uasyncio works the same either way. But, certainly yes, once the PR to
modlwip.c has been merged, then uasyncio shouldn't have those extra flags.
|
To clarify: My original point and basis for *all* of these mods is that
modlwip in main is not POSIX compliant. It will never return
unsolicited HUPs and ERRs because of incorrect masking. The work-around
to the non POSIX modlwip is a non POSIX uasyncio. In this case, two
wrongs make it right and there really is no adverse effect whatsoever.
For uasycnio to not be broken and still be POSIX compliant, modlwip must
be fixed. This is the fundamental point I've been trying to broadcast.
I'm trying to think of any good reason to leave modlwip as non-POSIX and
use it sometimes with and sometimes without error return events... I can't.
|
OK, thanks for the clarification. I think you are absolutely right about modlwip. |
@jimmo - Do you think you'll be able get some attention on the
recommended modlwip.c modifications and having them merged into
micropython and pycopy? @peterhinch's fast_io includes a functional
work-around which we'll be testing and using. Kudos to @peterhinch for
adapting the debugging notes here and adopting this in the fast_io
module. However, due the severity of these bugs, the community here
would benefit from having both uasyncio and modlwip.c formally fixed.
|
Thank you for the vote of confidence :) I agree with the need for action on official MicroPython. You may misunderstand the situation with pycopy which is an unofficial fork of MicroPython. Paul Sokolovsky (@pfalcon) is the sole maintainer of pycopy and its associated library. The best way to get that fixed is to submit PR's/issues yourself. |
POSIX poll should always return POLLERR and POLLHUP in revents, regardless of whether they were requested in the input events flags. See issues micropython#4290 and micropython#5172.
Thanks @t35tB0t and @peterhinch for the detailed report and discussion.
I agree, it should be POSIX compliant. See #5222 for an attempted fix which is slightly different to the one submitted by @t35tB0t . |
Note that current uasyncio may actually handle the case of unsolicited POLLERR/POLLHUP being returned from poll because the unix port already has this behaviour. |
@dpgeorge - Many thanks for diverting your attention to the minor but
important modlwip.c issues and merging the code changes into main. As
a caution to users... please be aware that uasyncio is now vulnerable to
unsolicited POLLERR/POLLHUPs being returned from poll because it is
lacking exception handling in critical sections (specifically the
start_server loop). Some of the suggested modifications have been
incorporated by @peterhinch into the fast_io version of asyncio. I have
been running a self-modified version of uasyncio with the latest
modlwip.c with good results. There are similar issues with PicoWeb and
possible other modules that are using uasyncio; all will be affected by
the modlwip change which now promiscuously reports unsolicited socket
errors. The test scripts I've posted here, with the randomly varying
delays are useful to expose vulnerabilities up through the exception
stack into added modules and user code.
|
POSIX poll should always return POLLERR and POLLHUP in revents, regardless of whether they were requested in the input events flags. See issues micropython#4290 and micropython#5172.
POSIX poll should always return POLLERR and POLLHUP in revents, regardless of whether they were requested in the input events flags. See issues micropython#4290 and micropython#5172.
POSIX poll should always return POLLERR and POLLHUP in revents, regardless of whether they were requested in the input events flags. See issues micropython#4290 and micropython#5172.
POSIX poll should always return POLLERR and POLLHUP in revents, regardless of whether they were requested in the input events flags. See issues micropython#4290 and micropython#5172.
…-08-18 update tinyusb
extmod/modlwip.c only returns POLL_HUP and POLL_ERR if the events flags specify them. AFIAK this is not POSIX compliant which specifies these return events shall be unsolicited. e.g. see http://man7.org/linux/man-pages/man2/poll.2.html
The impact is that uasyncio (which does not set POLL_HUP and POLL_ERR flags) will not see these failed socket connections and co-routines will hang and consume memory. Eventually micropython will run out of memory. Adding these flags to uasyncio is a work-around to getting the socket errors returned but is not the expected behavior of these ioctl polls. The modlwip.c changes below were tested and, work when appropriate exception handling is added to uasync (specifically in the start_server() loop). Socket errors will also now be returned the the yielding co-routined where appropriate exception handling including stream read.aclose() and write.aclose() must be called to clean up upon pre-mature connection termination.
extmod_modlwip_c.diff.txt
note: In diff, (flags,ret) renamed to (events,revents) to align with common parlance in ioctl polling
The text was updated successfully, but these errors were encountered: