Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modlwip.c does not properly return POLL_HUP and POLL_ERR socket errors #5172

Closed
t35tB0t opened this issue Oct 2, 2019 · 31 comments
Closed

Comments

@t35tB0t
Copy link

t35tB0t commented Oct 2, 2019

extmod/modlwip.c only returns POLL_HUP and POLL_ERR if the events flags specify them. AFIAK this is not POSIX compliant which specifies these return events shall be unsolicited. e.g. see http://man7.org/linux/man-pages/man2/poll.2.html

The impact is that uasyncio (which does not set POLL_HUP and POLL_ERR flags) will not see these failed socket connections and co-routines will hang and consume memory. Eventually micropython will run out of memory. Adding these flags to uasyncio is a work-around to getting the socket errors returned but is not the expected behavior of these ioctl polls. The modlwip.c changes below were tested and, work when appropriate exception handling is added to uasync (specifically in the start_server() loop). Socket errors will also now be returned the the yielding co-routined where appropriate exception handling including stream read.aclose() and write.aclose() must be called to clean up upon pre-mature connection termination.

extmod_modlwip_c.diff.txt

note: In diff, (flags,ret) renamed to (events,revents) to align with common parlance in ioctl polling

@peterhinch
Copy link
Contributor

Compare #4290

@t35tB0t
Copy link
Author

t35tB0t commented Oct 4, 2019

@peterhinch - Kudos to you for the finding this issue way back in #4290 and documenting better than I did here. I'm very certain that its the very same issue and fixed for STM32s with the mods made to the related modules. I have not dug into the ESP code base to understand the related behaviors. However, I'm not sure I agree with the expected error returns. Reading the related linux man page, it is pretty clear that the revents shall NOT return the callers event flags. So we shouldn't be getting POLL_IN + POLL_HUP (17) returned. If so, then the ESP stacks are also broken in a non-compliant way. Of course we can deal with that in the upper layers but it isn't a good direction to be moving in. One thing to try with the ESPs is simply adding the POLL_HUP | POLL_ERR flags to the uasyncio poll read and write calls and see how that changes your results. This is a socket ioctl API issue that may be impacting multiple targets. However, the lack of error handling in uasyncio's start_server loop seems to be a global issue. Yes, one can trap that error higher up, but crashing and restarting a server on every connection dropped before accept is not OK.

As soon as I have a break here, I will clean all the excess debug notes in the related files and post a complete working example which is AFAIK, tested and ready for merge consideration.

@peterhinch
Copy link
Contributor

As a very general comment the testing I and @kevinkk525 performed was to determine the response to dropped connections and, critically, to WiFi dropouts. The latter are commonplace. Any updates should be tested against these conditions in addition to testing against malicious peers.

It's possible that uasncio.start_server is only used by Picoweb. We avoid it because of the fragility mentioned above.

@t35tB0t
Copy link
Author

t35tB0t commented Oct 6, 2019

After a few tweaks - there is nothing wrong with uasyncio.start_server, it is no longer fragile.

Here's some simple testing tools crafted generate abandoned connections, reset connections etc. The sock_test.py script should be launched as a multiple subprocesses. This will generate a flood of overlapping reset HTTP connections. With randomized delays and multiple generators, all the code race conditions are eventually hit. To simulate the Wifi dropped connections, the scapy library is used with its lower level packet manipulations. The command-line launch info is commented in the scapy-cAbandon.py script. One of the options here will provide a simple SYN flood attack. But I suggest using the -v True mode for our purposes here.

scapy-cAbandon.zip
socket_test.zip

@t35tB0t
Copy link
Author

t35tB0t commented Oct 6, 2019

Here's a complete working test case with enhanced exception handling. Note that extra poll flags are included to force modlwip.c ioctl poll to return the POLL_HUP and POLL_ERR revents just in case you haven't re-compiled a fixed modlwip.c yet. This makes it "just work" in either case but should be considered debug code and removed at some point. The test case below was run on a STM32F767. Prior testing has been performed with a STM32F429. Scripting and object coding errors were correctly trapped and reported from all layers. Socket state errors were correctly handled (AFAICT) for floods of connection abandons, resets, and simple SYN attacks. In all cases browser was able to access the demo webapp GET / info page unless the stack was "jammed" where it would then wait until the nasty connections flood subsided and then would resume. The memory stats reporting isn't high-fidelity but it does at least show that there isn't any leakage after a lengthy barrage of bad connection behaviors. Compared to where we started, this works for me and IMHO it also addresses the popular WiFi dropout issue in addition to connection resets (at least from the aspect of not chewing up memory and crashing). If we have any other connection states we need to consider, let me know.

As a sidenote - I noticed WireShark showing a browser (client) sending TCP keep-alive packets and LWIP (server) appeared to be responding correctly. I know there is an interest in using keep-alives to sustain connections. This is at least a positive indications that some of the mechanisms are in place and functioning. It may just be a matter of LWIP API to expose more of this within micropython environment?

PYBFLASH_20191006.zip

P.S. I apologise if there are any issues or offense with the copyright clauses or other header fluff - this is just text/example code.

@peterhinch
Copy link
Contributor

I'm looking at your uasyncio changes with a view to implementing them in my fast_io fork. The only non-debug change seems to be in __init__.py in the start_server() function.

I appreciate it isn't final code but the following two lines in __init__.py seem to be redundant:

import sys

and the final pass statement in start_server.

The change seems benign: I can't see it breaking any existing code, although I appreciate it needs firmware changes to achieve its aim.

Please let me know if I'm missing anything.

I've also taken a quick look at PicoWeb. I notice that you issue

reader.aclose()

As I understand it StreamReader.aclose() is asynchronous so I'd expect to see

yield from reader.aclose()

@t35tB0t
Copy link
Author

t35tB0t commented Oct 6, 2019

@peterhinch - correct, the sys import related to exception stack trace dumps debug code that I deleted. And, as I've been saying, the start_server() function has a race condition between its IORead(s) and s.accept that must be trapped at this layer or we end up crashing the entire server loop over spurious network behaviors. Sure, its a rare event but it certainly would be a head-scratcher for many users. Finally, the changes in PicoWeb appear simple but are a bit trickier.

And, AFAICT, you don't need to yield on closing reader polling objects. Actually, doing so in an exception will result in a dictionary key error because of how the yielding works. I wasn't that smart - I added all the yeilds and watched it fail. So, the constructs that are in place will properly handle the exceptions we were able to trigger without internal errors and without hanging the client due incomplete HTTP exchanges.

And its not perfect yet. After running all night, the enhanced exception trapping code tripped over an unusual exception type and it halted the execution. My original approach was to wrap all the errors in this middleware and blame the network but @pfalcon has raised the bar and we're trying to provide that solution.

Traceback (most recent call last):
File "main.py", line 18, in
File "web.py", line 28, in run
File "picoweb/init.py", line 309, in run
File "uasyncio/core.py", line 109, in run_forever
File "picoweb/init.py", line 210, in _handle
IndexError: tuple index out of range
MicroPython v1.11-292-g5641aa55d-dirty on 2019-09-25; NUCLEO-F767ZI with STM32F767
Type "help()" for more information.`

The offending PicoWeb Line 210 in the demo code is:
if not uerrno.errorcode.get(e.args[0],False):

I was just being curt when I wrote that line last night. So its vulnerable to e being an empty tuple. I suppose we need to beef this up a bit. Keep in mind that, when we simply pass on all errors here, we never crash. So this can't be a major resource issue and may very well just be another socket triggered error type that needs to be added to the exception handler here. I have an idea of how to beef this up and am running it now...

@t35tB0t
Copy link
Author

t35tB0t commented Oct 6, 2019

UPDATE: Made this modification in PicoWeb (and similar in e.args length trap in uasyncio start_server). This should show us what rare event is throwing the error with zero length e.args tuple. It may take several hours or all day to trip this error. I'll post the results when that happens...

    except Exception as e:
        if len(e.args)==0:
            # Handle exception as an unexpected unknonwn error: collect details here then close try to continue running
            sys.print_exception(e)
            reader.aclose()
            yield from writer.aclose()
            raise
        elif not uerrno.errorcode.get(e.args[0],False):
            # Handle exception as internal error: close and terminate handler (user must trap or crash)
            reader.aclose()
            yield from writer.aclose()
            raise
        else:
            # Handle exception as bad socket state: close and continue
            close=True
            if self.debug:
                self.log.exc(e, "%.3f %s %s %r" % (utime.time(), req, writer, e))

@t35tB0t
Copy link
Author

t35tB0t commented Oct 6, 2019 via email

@peterhinch
Copy link
Contributor

Apologies if I'm missing something but I still think issuing

reader.aclose()

will never close the socket: it will simply instantiate a generator object. The code is:

    def aclose(self):
        yield IOReadDone(self.polls)
        self.ios.close()

Consider the following:

def foo():
    yield 1
    print('foo')

If I paste this at the REPL and issue

>>> foo()
<generator object 'foo' at 7f9b346d69a0>
>>> 

it does not print anything, but merely returns a generator instance.

@t35tB0t
Copy link
Author

t35tB0t commented Oct 7, 2019 via email

@peterhinch
Copy link
Contributor

I can't see that the line

reader.aclose()

is doing anything useful. It's instantiating a generator and discarding it. Commenting the line out would prove the point one way or the other.

@t35tB0t
Copy link
Author

t35tB0t commented Oct 7, 2019

@peterhinch ...because you're correct. I had a prior test case where the web browser was hanging waiting for the server to finish. Adding these appeared to change the behavior and the remote client was able to happily complete its page rendering. Its likely that it they just slowed micropython down with enough of a delay for more bytes to get out on the wire before the writer.aclose killed the outbound packet in LWIP. I'll have to build back up to that more complex case and see if we can re-create it. Meanwhile, I've commented out the useless reader.aclose()s but will keep in mind that we may want to add a brief delay in the exception handler or more directly check status to permit awrite()en data to get out onto the wire before the socket evaporates. That's the best I've got on that at this time.

The zip'd project below is where I've ended up after a more careful trace of the exceptions thrown at the various levels and locations. The illusive e.args empty tuple error was only seen twice and I couldn't determine where it may have come from. So, I've added some traps (and made it non-fatal) so we get more details when it ever happens again. It doubt we're 100% done with this yet but its running annoyingly crash-free at the moment which may sound odd but is not.

Note: My primary long-run test at the moment is the socket reset because its pretty straight-forward. With the default STM32 socket limit of qty 5, I'm running three subprocesses of the socket_test randomly reset HTTP GETs. I tried adding more data after the HTTP header in the request but that didn't do anything interesting (ie no errors thrown and only slowed the connection rate down)

PYBFLASH_20191007.zip

(this now has more substantial changes plus some added cosmetic alterations for readabilty)

@peterhinch
Copy link
Contributor

peterhinch commented Oct 7, 2019

I've looked at the uasyncio changes.
Re core.py I'd be surprised if there's a significant bug in core.py: it has been extensively tested. The added int() statements seem to do nothing as the quantities to which they apply are already integers:

                delay = int(time.ticks_diff(t, tnow))

and

l = int(len(self.runq))

Does the added

except Exception as e:

ever get called? I appreciate this may be debug code but if it gets called I would suspect that the underlying cause is an error in __init__.py. If this code is to be permanent, we need to be sure it won't affect normal exception handling in user coroutines.

Re __init__.py You're clearly making good progress in addressing the socket handling.

In start_server you have:

        except Exception as e:
            if len(e.args)==0:

Are you getting exceptions without args? If so I think we need to find out where these are coming from as they may be indicative of a MicroPython bug. My understanding is that Python exceptions always have an args tuple with at least one element. Unless there is a legitimate case of which I'm unaware?

Your zipfiles don't include your changes to extmod/modlwip.c. I'm not the best person to review this, but maybe someone with appropriate experience would take a look. When you have a version which you regard as final I could test for breakages to my existing libraries.

@t35tB0t
Copy link
Author

t35tB0t commented Oct 7, 2019 via email

@peterhinch
Copy link
Contributor

Thank you for that excellent summary. It would be good to know where in the firmware that strange exception originates, but I appreciate you may not have time to follow this up. Here is my take on what should happen next.

modlwip.c

Regarding the lack of official response to #5172 I know that @dpgeorge and @jimmo are concerned about this and I'm sure they would welcome a PR.

Submitting a PR is the way to get your code reviewed and implemented. I suggest reading the code conventions guide as Damien is keen to maintain consistency. If your changes are minor compliance will be easy.

uasyncio

I will implement the error trapping in my fast_io uasyncio fork.

I'm not sure what to suggest for official micropython-lib. We could submit a PR, but in the past these tended to be ignored. I suggest we await the response to a modlwip PR: on implementation we can see if a uasyncio PR would be welcome and we can raise it. I'd be happy to do this if you'd rather - but only if it seemed likely that it would get attention from the maintainers.

picoweb and pycopy

Paul is the owner/maintainer. Hopefully he will respond to PR's for picoweb, modlwip and uasyncio. It's entirely up to you how or if you pursue this. My interest is solely with official MicroPython.

@t35tB0t
Copy link
Author

t35tB0t commented Oct 8, 2019 via email

@jimmo
Copy link
Member

jimmo commented Oct 8, 2019

@t35tB0t As I mentioned on micropython/micropython-lib#353 (comment) I'm happy to do the work here to make this a PR and get it submitted. Thanks for posting the diff, I think I can take if from here if you want?

@t35tB0t
Copy link
Author

t35tB0t commented Oct 8, 2019 via email

@peterhinch
Copy link
Contributor

peterhinch commented Oct 8, 2019

I've started work on fast_io and have some questions of detail re __init__.py:

  1. In StreamReader.readexactly you replaced self.ios.read(n) with self.ios.readline(n). In general data read by readexactly may not be line-oriented, especially as stream devices may not be sockets. What is the reason for this change?
  2. You have calls to log.DEBUG. On a Pyboard a logger instance has a debug method but no DEBUG. I have used the lowercase method.
  3. In the event of an exception your start_server method doesn't close the sockets. Paul's code does. I currently have the following, adapted from Paul's code and yours. Please let me know your thoughts.
    # Code omitted
    s.listen(backlog)
    try:
        while True:
            try:
                if DEBUG and __debug__:
                    log.debug("start_server: Before accept")
                yield IORead(s)
                if DEBUG and __debug__:
                    log.debug("start_server: After iowait")
                s2, client_addr = s.accept()
                s2.setblocking(False)
                if DEBUG and __debug__:
                    log.debug("start_server: After accept: %s", s2)
                extra = {"peername": client_addr}
                # Detach the client_coro: put it on runq
                yield client_coro(StreamReader(s2), StreamWriter(s2, extra))
                s2 = None  # From Paul's code.

            except Exception as e:
                if len(e.args)==0:
                    # This happens but shouldn't. Firmware bug?
                    # Handle exception as an unexpected unknown error:
                    # collect details here then close try to continue running
                    print('start_server:Unknown error: continuing')
                    sys.print_exception(e)
                if not uerrno.errorcode.get(e.args[0], False):
                    # Handle exception as internal error: close and terminate
                    # handler (user must trap or crash)
                    print('start_server:Unexpected error: terminating')
                    raise
    finally:  # From Paul's code
        if s2:
            s2.close()
        s.close()

As discussed your start_server exception handler implies you're getting odd errors which may indicate a firmware issue. Do you think it would help the developers to post a log of what you're getting?

[EDIT]
I have posted code here.
I appreciate you may not have time to look at this but any comments would be welcome. core.py is unchanged (from my fast_io version) apart from the version number. Changes to __init__.py are as discussed. I have done basic testing with a UART and with a version of PicoWeb.

@t35tB0t
Copy link
Author

t35tB0t commented Oct 9, 2019 via email

@peterhinch
Copy link
Contributor

Looking at the docs for poll.register I see

Note that flags like uselect.POLLHUP and uselect.POLLERR are not valid as input eventmask (these are unsolicited events which will be returned from poll() regardless of whether they are asked for). This semantics is per POSIX.

It looks like we shouldn't be explicitly registering this eventmask.

Can I suggest that you just register select.POLLIN and select.POLLOUT as per the original code and verify that POLLUP and POLLERR still work?

@t35tB0t
Copy link
Author

t35tB0t commented Oct 9, 2019 via email

@t35tB0t
Copy link
Author

t35tB0t commented Oct 9, 2019 via email

@peterhinch
Copy link
Contributor

OK, thanks for the clarification. I think you are absolutely right about modlwip.

@t35tB0t
Copy link
Author

t35tB0t commented Oct 13, 2019 via email

@peterhinch
Copy link
Contributor

Thank you for the vote of confidence :) I agree with the need for action on official MicroPython.

You may misunderstand the situation with pycopy which is an unofficial fork of MicroPython. Paul Sokolovsky (@pfalcon) is the sole maintainer of pycopy and its associated library. The best way to get that fixed is to submit PR's/issues yourself.

dpgeorge added a commit to dpgeorge/micropython that referenced this issue Oct 16, 2019
POSIX poll should always return POLLERR and POLLHUP in revents, regardless
of whether they were requested in the input events flags.

See issues micropython#4290 and micropython#5172.
@dpgeorge
Copy link
Member

Thanks @t35tB0t and @peterhinch for the detailed report and discussion.

extmod/modlwip.c only returns POLL_HUP and POLL_ERR if the events flags specify them. AFIAK this is not POSIX compliant which specifies these return events shall be unsolicited.

I agree, it should be POSIX compliant. See #5222 for an attempted fix which is slightly different to the one submitted by @t35tB0t .

dpgeorge added a commit that referenced this issue Oct 31, 2019
POSIX poll should always return POLLERR and POLLHUP in revents, regardless
of whether they were requested in the input events flags.

See issues #4290 and #5172.
dpgeorge added a commit that referenced this issue Oct 31, 2019
POSIX poll should always return POLLERR and POLLHUP in revents, regardless
of whether they were requested in the input events flags.

See issues #4290 and #5172.
@dpgeorge
Copy link
Member

I'll close this issue because the original bug is fixed: POLLHUP and POLLERR (and now POLLNVAL) are unconditionally returned in socket polling. See 71401d5, feaa251 and 26d8fd2

@dpgeorge
Copy link
Member

Note that current uasyncio may actually handle the case of unsolicited POLLERR/POLLHUP being returned from poll because the unix port already has this behaviour.

@t35tB0t
Copy link
Author

t35tB0t commented Nov 1, 2019 via email

alu96 pushed a commit to alu96/micropython that referenced this issue Mar 23, 2020
POSIX poll should always return POLLERR and POLLHUP in revents, regardless
of whether they were requested in the input events flags.

See issues micropython#4290 and micropython#5172.
alu96 pushed a commit to alu96/micropython that referenced this issue Mar 23, 2020
POSIX poll should always return POLLERR and POLLHUP in revents, regardless
of whether they were requested in the input events flags.

See issues micropython#4290 and micropython#5172.
c0d3z3r0 pushed a commit to c0d3z3r0/micropython that referenced this issue Apr 5, 2020
POSIX poll should always return POLLERR and POLLHUP in revents, regardless
of whether they were requested in the input events flags.

See issues micropython#4290 and micropython#5172.
c0d3z3r0 pushed a commit to c0d3z3r0/micropython that referenced this issue Apr 5, 2020
POSIX poll should always return POLLERR and POLLHUP in revents, regardless
of whether they were requested in the input events flags.

See issues micropython#4290 and micropython#5172.
tannewt pushed a commit to tannewt/circuitpython that referenced this issue Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants