Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extmod: add new implementation of uasyncio #5332

Merged
merged 21 commits into from Mar 25, 2020

Conversation

dpgeorge
Copy link
Member

This PR adds a completely new implementation of the uasyncio module. The aim of this version (compared to the original one in micropython-lib) is to be more compatible with CPython's asyncio module, so that one can more easily write code that runs under both MicroPython and CPython (and reuse CPython asyncio libraries, follow CPython asyncio tutorials, etc). Async code is not easy to write and any knowledge users already have from CPython asyncio should transfer to uasyncio without effort, and vice versa.

The implementation here attempts to provide good compatibility with CPython's asyncio while still being "micro" enough to run where MicroPython runs. This follows the general philosophy of MicroPython itself, to make it feel like Python.

The existing uasyncio at micropython-lib has its merits and will remain as an independent module/library, but would need to be renamed so as to not clash with the new implementation here. Note that the implementation in this PR provides a compatibility layer to be compatible (for the most part) with the original uasyncio.

It's currently implemented in pure Python and runs under existing, unmodified MicroPython (there's a commit in this PR to improve allocation of iterator buffers but that is not needed for uasyncio to work). In the future parts of this implementation could be moved to C to improve speed and reduce memory usage. But it would be good to maintain a pure-Python version as a reference version.

At this point efficiency is not a goal, rather correctness is. Tests are included in this PR.

Thanks to @peterhinch and @kevinkk525 for help with initial testing and bug finding.

@dpgeorge
Copy link
Member Author

Features available in this version (so far):

  • basic task creation, create_task(), run() functions
  • sleep(); and sleep_ms() extension
  • Task objects, ability to cancel and await on tasks
  • Lock and Event class
  • gather() and wait_for() functions
  • TCP client and server via open_connection() and start_server()
  • supports duplex streams
  • does not allocate any heap in the core event loop, including sleep() calls

@dpgeorge
Copy link
Member Author

In terms of total code size, it's currently smaller than the original uasyncio.

On a PYBv1.0 with uasyncio a frozen module (with mpy-cross optimisation level 3, no debugging) the firmware size for the frozen module is:

  • original uasyncio: 9432 bytes
  • this new version: 8100 bytes (with pend-throw and utimeq still enabled)
  • this new version: 7204 bytes (with pend-throw and utimeq disabled)

Note that pend-throw and utimeq are only needed for the original uasyncio, so a fair comparison is to exclude them with the new version.

Eventually parts of this new uasyncio could be rewritten in C, but it's not clear if that would increase or decrease size.

@dpgeorge
Copy link
Member Author

In terms of base memory usage, I tested the following simple program which runs on both original and new uasyncio:

import uasyncio as asyncio

async def main():
    print('start')
    await asyncio.sleep(1)
    print('end')

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

On the unix port, x86 64bit, the minimum heap needed to run this program (with uasyncio frozen) is:

  • original uasyncio: -X heapsize=11280
  • this new version: -X heapsize=8180

So this new version has lower requirements for heap RAM for the basic event scheduler.

@dpgeorge
Copy link
Member Author

In terms of raw performance scheduling tasks, I tested with a slightl modified version of @peterhinch 's rate.py found at https://github.com/peterhinch/micropython-async/blob/master/benchmarks/rate.py (yield replaced with await asyncio.sleep_ms(0)). This tests the time taken to switch from one task/coro to another.

Tests run on PYBv1.0 with frozen uasyncio.

For the original uasyncio, it could create up to 1350 tasks before it ran out of memory. The time taken for switching versus number of tasks/coros was:

Coros   10  Iterations/sec  4660  Duration 214us
Coros   50  Iterations/sec  4855  Duration 205us
Coros  100  Iterations/sec  4875  Duration 205us
Coros  200  Iterations/sec  4950  Duration 202us
Coros  500  Iterations/sec  5100  Duration 196us
Coros 1000  Iterations/sec  5250  Duration 190us
Coros 1350  Iterations/sec  5900  Duration 169us

(So that's about 200us to switch task.)

For this new version, it could create up to 1000 tasks before it ran out of memory. Timing results:

Coros   10  Iterations/sec  3070  Duration 325us
Coros   50  Iterations/sec  3065  Duration 326us
Coros  100  Iterations/sec  3071  Duration 325us
Coros  200  Iterations/sec  3055  Duration 327us
Coros  500  Iterations/sec  3009  Duration 332us
Coros 1000  Iterations/sec  2877  Duration 347us

So this new version takes a bit more RAM per task (about 30% more), and a bit more time to switch (about 65% more). But keep in mind that this new version is written in pure Python while the original uasyncio has the queuing routines written in C. Parts of this new version can be rewritten in C to reduce RAM usage per task and improve scheduling speed.

@dpgeorge
Copy link
Member Author

I also tested performance using ApacheBench (ab), with ab -n10000 -c100 ..., so 10k requests total with 100 concurrent ones.

  • original uasyncio does 21000 requests per second, maximum request took 9ms
  • this new version does 19500 requests per second, maximum request took 7ms

It's worth noting that with the original uasyncio if more requests come in than it is preconfigured to handle (via the fixed runq/waitq lengths) then it fails with a IndexError due to queue overflow. On this new version this doesn't happen because the schedule queue is a linked list.

@dpgeorge
Copy link
Member Author

The new uasyncio in this PR has a few compatibility functions/classes/methods which means that it can run most existing uasyncio apps (as long as they only used public API functions), including the latest picoweb version unmodified (tested with example_webapp.py).

@dpgeorge
Copy link
Member Author

The version here would fix issues raised in #5242 and #5276 (cancellation of tasks), and I also did some basic testing related to #5172 (POLLHUP and POLLERR handilng) and it seems to work correctly there.

@dxxb
Copy link
Contributor

dxxb commented Nov 15, 2019

Great work, thanks to everyone involved.

@peterhinch
Copy link
Contributor

This script fails - cancellation of a task which has run to completion. Formerly worked. Tested on a Pyboard 1.1.

import uasyncio as asyncio

async def test():
    print("test")
    await asyncio.sleep(0.1)  # Works if value is 5
    print('test2')

async def main():
    t = asyncio.create_task(test())
    await asyncio.sleep(0.5)
    print('gh')
    t.cancel()
    await asyncio.sleep(1)
    print('done')

asyncio.run(main())

Outcome:

test
test2
gh
task raised exception: None
Traceback (most recent call last):
  File "uasyncio.py", line 456, in run_until_complete
AttributeError: 'NoneType' object has no attribute 'throw'
done
>>> 

@dpgeorge
Copy link
Member Author

Doing some initial optimisation, rewriting the Queue and Task classes in C, total code size (Python+C) is increased by about 160 bytes on stm32, and the rate.py benchmark on PYBv1.0 gives:

Coros   10  Iterations/sec  6265  Duration 159us
Coros   50  Iterations/sec  6273  Duration 159us
Coros  100  Iterations/sec  6294  Duration 158us
Coros  200  Iterations/sec  6323  Duration 158us
Coros  500  Iterations/sec  6393  Duration 156us
Coros 1000  Iterations/sec  6560  Duration 152us
Coros 1500  Iterations/sec  6331  Duration 157us

(So can do more tasks at once, and faster switching, than the original uasyncio.)

Testing with ApacheBench gives about 25500 requests per second, maximum request time 6ms.

This optimised code is not pushed here, just proof of concept that it's possible to optimise using C.

@dpgeorge
Copy link
Member Author

This script fails - cancellation of a task which has run to completion.

Thanks, can confirm. Should be fixed by latest commit.

@peterhinch
Copy link
Contributor

@dpgeorge YHM re a possible solution to the I/O priority issue.

@hoihu
Copy link
Sponsor Contributor

hoihu commented Nov 15, 2019

Thanks, really looking forward to it!

We have an application where we hande several UART streams and it's beneficial if they can be served fast (e.g. for USB VCP, but also for internal UART's). Any other ideas about introducing some basic priority handling (e.g. as on peters fork fast_io)?

@peterhinch
Copy link
Contributor

@dpgeorge @hoihu I have posted a version of the new uasyncio which supports fast scheduling. The code is here and a test script may be found here.

This solution improves on than that in my fast_io fork because the fast I/O option is specified on a per-stream basis rather than globally. This was in response to a suggestion from @dpgeorge who pointed out that a fast stream which did not require priority scheduling (such as a fast socket) could hog the scheduler.

Fast I/O is specified by means of Stream.priority(v=True). In the case of bidirectional streams, the priority value applies in both directions.

@dpgeorge Feel free to adapt any of this code as you see fit.

@kevinkk525
Copy link
Contributor

I'm using this new uasyncio implementation in my mqtt connected application (using mqtt_as with the new Lock implementation and using Task Cancellation frequently) since a few days now and it works superb.

The only bugs I know of at the moment are rare scenarios with wait_for and gather, but I guess that nobody will encounter those any time soon :)

@nevercast
Copy link
Contributor

nevercast commented Nov 20, 2019

Will there be await on Pin coming soon, or custom signals that can be awaited so that interrupts can be used without a while loop?

Similar to Event https://docs.python.org/3/library/asyncio-sync.html#asyncio.Event

@peterhinch
Copy link
Contributor

@dpgeorge You might like to look at this. I made a few changes and provided synchronisation primitives adapted to use your new version efficiently.

uasyncio is implemented as a Python package. This requires primitives to be explicitly imported: while not CPython compatible it does make for significant RAM savings.

The Queue class is adapted from Paul's code, and the Lock class is Kevin Köck's solution, included for completeness and so my tests can run.

@dpgeorge
Copy link
Member Author

Will there be await on Pin coming soon, or custom signals that can be awaited so that interrupts can be used without a while loop?

Event is already provided so can be used to signal custom events, like pin change. Although it might need some slight improvements so that Event.set() can be called from a (soft) interrupt handler.

Direct waiting on a pin, like await pin, could be implemented (maybe as await pin.wait_high() for example) but it'd need some thought/design as to how it behaves. In what way would you use this feature?

@t35tB0t
Copy link

t35tB0t commented Nov 22, 2019

@dpgeorge - As a note of caution - the new uasyncio may have some of the same bad socket error behaviors in the prior versions. I haven't reviewed and fully absorbed then tested the new uasyncio here but consider the following in the socket server class... (and please excuse me if I'm just mis-reading the commits here)...

Any socket accept s.accept() that follows a yielded wait for a client connection may throw an exception. This is because the client connection may be aborted or otherwise dropped in that thin time slice between the connection and the call to s.accept() it. If not properly wrapped in a Try/Except, this will crash the server. IMHO, the server should handle such an exception by ignoring the error, skipping over the new task generation, and going back to waiting for connections.

The server crash behavior was easily demonstrated with the older uasyncio. @peterhinch has the details and fix from the prior testing and can probably correct or collaborate my concern here.

while True:
           _io_queue.queue_read(s)
            try:
                yield
            except CancelledError:
                # Shutdown server
                s.close()
                return
            s2, addr = s.accept()   <--- WILL CRASH SERVER IF CLIENT HAS ABORTED SOCKET
            s2.setblocking(False)
            s2s = Stream(s2, {'peername': addr})
            create_task(cb(s2s, s2s))

POSSIBLY FIX WITH (UNTESTED):

while True:
           _io_queue.queue_read(s)
            try:
                yield
            except CancelledError:
                # Shutdown server
                s.close()
                return
            try:
                 s2, addr = s.accept()
            except:
                continue
            else:
                s2.setblocking(False)
                s2s = Stream(s2, {'peername': addr})
                create_task(cb(s2s, s2s))

@nevercast
Copy link
Contributor

@dpgeorge

In what way would you use this feature?

I would like to await a pin change as well as a timeout, concurrently, and yield when any of these yield. The pin will be a data available interrupt from external hardware, I would like to avoid implementing IRQ handlers in my code and keep everything asyncio where I can, and avoid using busy-loops for timeout functionality when waiting for a state change from IRQ.

I believe asyncio is a nice way to wrap up these problems in to straight-forward code.

The data is fetched over SPI on ESP32 which is currently blocking and would need to be made asyncio compatible also.

@kevinkk525
Copy link
Contributor

This would also be interesting in a case where I use an Arduino connected through 1-wire with the goal of controlling its pins and reading its ADCs. Currently to stay compatible to machine.Pin and machine.ADC classes I use synchronous calls but these block for quite some time, especially if a retransmission is needed. An awaitable Pin object would make this a lot better.

This is of course a bit different from the scenario you and damien wrote about.

@peterhinch
Copy link
Contributor

Efficient waiting on a pin could be achieved using the ioread mechanism with no changes to uasyncio. With a suitable ioctl polling would be delegated to select.poll (which is implemented in C). I'll see if I can produce a demo of an awaitable PinChange class.

@nevercast
Copy link
Contributor

nevercast commented Nov 25, 2019

ioctl polling would be delegated to select.poll

That's still polling, though, right? I'm wondering if an event driven approach is better. Since its coming from a hardware interrupt.

@dpgeorge
Copy link
Member Author

That's still polling, though, right? I'm wondering if an event driven approach is better. Since its coming from a hardware interrupt.

Yes it's still polling, using select.poll. With a lot of events being waited on (eg many pins waiting on change, sockets for read/write, UART, async SPI, etc) it won't scale well to register everything with select.poll as it currently works.

select.poll (on bare metal ports) could be reworked so that it used interrupts (hardware events) to move objects from the poll waiting list to the poll ready list. That would require a fair amount of effort.

An alternative would be for interrupts (hardware events) to directly schedule uasyncio tasks, by creating an Event() object for each hardware event and calling event.set() when the interrupt fires.

The polling approach may seem like a better option but it's quite limiting because it only has the concept of readable/writable. Eg if one task waits for a pin to go high, and another task waits for the same pin to go low, that's 2 distinct events, and how do they map to pollability? With a duplex UART it's possible to wait for reading and writing at the same time, but objects like pins and other things that may have more than 2 distinct events, it's hard to fit that into select.poll behaviour.

To make polling work would probable need to create a distinct wrapper object for each event, and register each object with select.poll. Eg for pin, would need to provide 4 event objects (pin high, pin low, pin rising edge, pin falling edge) which all get registered with POLLIN and become "readable" when the event occurs.

@nevercast
Copy link
Contributor

Personally a fan of this one:

An alternative would be for interrupts (hardware events) to directly schedule uasyncio tasks, by creating an Event() object for each hardware event and calling event.set() when the interrupt fires.

@peterhinch
Copy link
Contributor

After considering the issues I think the Event approach is best. An Event seems the obvious way to synchronise a coroutine with an ISR; further, enabling .set() to be called asynchronously is a worthwhile end in itself - eventually someone will do it and wonder why it's unreliable.

And use Ubuntu bionic for qemu-arm Travic CI job.
This fixes a bug in the pairing-heap implementation when nodes are deleted
with mp_pairheap_delete and then reinserted later on.
This commit adds a completely new implementation of the uasyncio module.
The aim of this version (compared to the original one in micropython-lib)
is to be more compatible with CPython's asyncio module, so that one can
more easily write code that runs under both MicroPython and CPython (and
reuse CPython asyncio libraries, follow CPython asyncio tutorials, etc).
Async code is not easy to write and any knowledge users already have from
CPython asyncio should transfer to uasyncio without effort, and vice versa.

The implementation here attempts to provide good compatibility with
CPython's asyncio while still being "micro" enough to run where MicroPython
runs. This follows the general philosophy of MicroPython itself, to make it
feel like Python.

The main change is to use a Task object for each coroutine.  This allows
more flexibility to queue tasks in various places, eg the main run loop,
tasks waiting on events, locks or other tasks.  It no longer requires
pre-allocating a fixed queue size for the main run loop.

A pairing heap is used to queue Tasks.

It's currently implemented in pure Python, separated into components with
lazy importing for optional components.  In the future parts of this
implementation can be moved to C to improve speed and reduce memory usage.
But the aim is to maintain a pure-Python version as a reference version.
All .exp files are included because they require CPython 3.8 which may not
always be available.
Includes a test where the (non uasyncio) client does a RST on the
connection, as a simple TCP server/client test where both sides are using
uasyncio, and a test for TCP stream close then write.
Implements Task and TaskQueue classes in C, using a pairing-heap data
structure.  Using this reduces RAM use of each Task, and improves overall
performance of the uasyncio scheduler.
Only included in GENERIC build.
@dpgeorge dpgeorge merged commit ad004db into micropython:master Mar 25, 2020
@dpgeorge
Copy link
Member Author

MERGED!

Thanks to all for the feedback, discussion, testing, etc.

Feel free to open new issues/PRs for items discussed above (and others) that were not fixed/included in these commits.

@dpgeorge dpgeorge deleted the extmod-uasyncio branch March 25, 2020 14:51
@mk-pmb
Copy link

mk-pmb commented Mar 25, 2020

Finally! Thanks all for your effort.

c0d3z3r0 pushed a commit to c0d3z3r0/micropython that referenced this pull request Apr 5, 2020
This commit adds a generator test for throwing into a nested exception, and
one when using yield-from with a pending exception cleanup.  Both these
tests currently fail on the native emitter, and are simplified versions of
native test failures from uasyncio in micropython#5332.
@dlech dlech mentioned this pull request May 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet