Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uasyncio task profiling #7456

Open
hoihu opened this issue Jun 24, 2021 · 6 comments
Open

Uasyncio task profiling #7456

hoihu opened this issue Jun 24, 2021 · 6 comments

Comments

@hoihu
Copy link
Sponsor Contributor

hoihu commented Jun 24, 2021

This issue discusses a possible profiling solution of uasyncio tasks.

see #7399 for the related disussion (specifically #7399 (comment)).

Our main motivation is to measure when and for how long a task is active on a embedded device. This is important to understand complex interactions when several hw interfaces are used. It also helps to understand the overall system load and which tasks are wasting processor power.

Ideally the uasyncio loop would provide a callback for this, which has no performace impact when not used..

We implemented the functionality by tracking the send calls in a wrapper of the coroutine in order to avoid patching the complex loop function. We understand that this adds an overhead on each task when enter/exit:

class TrackedCoro:
    def __init__(self, coro):
        self._coro = coro

    def send(self, value):
        signal(ENTER....)   # send profiling packet on enter
        self._coro.send(value)
        signal(EXIT...) # send profiling packet on exit

The wrapper sends a small profiler frame over UART/USB, consisting of a few bytes with a timestamp and some basic information about the coro. An external program can visualize the individual tasks, similar as on a logic analyser.

Here is an example on how it looks:

image

Overall this has been really useful to obtain insights in the asyncio's taks.

So that's the intention of the request to access the coro. Maybe other approaches exists to implement a similar, more efficent way? Other than that we are building our image with the patch linked above but would of course like to have as little differences to upstream micropython..

@stephanelsmith
Copy link
Sponsor Contributor

stephanelsmith commented Aug 20, 2021

This is awesome. I've been looking/thinking about something like this. From the looks of it, I have a system similar to yours, with a number of peripherals and interlinked tasks.

I'm currently looking into if we can squeeze machine.idle() or sleep in some of the down times.

What software did you use here? love it. Would like to try to get something similar going on my end, essential.

@peterhinch
Copy link
Contributor

Here is my take on this.

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Sep 22, 2021

What software did you use here? love it. Would like to try to get something similar going on my end, essential.

It's an wxPython based application. The scope view above is a wx widget that has been developed inhouse. The tool also consists of other views, such as a statistic view with min/max/medium times of the tasks plotted above.

Would like to try to get something similar going on my end, essential.

It's definetly a time consuming investment but it helps a lot while debugging! If your application is complex I'd very much recommend it.

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Sep 22, 2021

Here is my take on this.

Thanks a lot peter for the update.

I understand you went the path that you need decorators for certain coros and you'll do pin toggling for enter/exit? One advantage of this is that you can use an existing visualizing backends, such as the saleae GUI. Pin toggling is also faster and might help when you want to debug really fast.

It is different than our approach, but I guess it's a matter of what you want to debug...

We basically wrapped "create_task" of uasyncio and assigned each task a name with an ID that is sent on enter/exit (togehter with the timestamp). This way you only need one UART to transmit all debugging informations. And all tasks are then automatically traced and do not need a decorator. Since it's an UART connection you can also use one of the USB VCP ports for this and get a debugging port for free without attaching a logic analyser ;)

But - you need a separate GUI that understands the UART frames and displays different channels based on the ID. And also, the UART transmission adds time lag / overhead.

@peterhinch
Copy link
Contributor

peterhinch commented Sep 23, 2021

Your GUI back end is very impressive but beyond my skills to create. By contrast, the code on the Pico is absurdly simple but does rely on the user having a scope or logic analyser.

I guess it's a matter of what you want to debug...

Absolutely. One aim is to enable viewing hardware state changes along with program state changes. The latency of a Pico pin change after a task starts is about 100μs.

Thank you for the insight into the design of your solution. Monitoring every coro wouldn't fit easily with my approach owing to physical limits on the number of channels on the scope, LA or (ultimately) pins on the Pico. I judged it was easier to move decorators around than physically juggling wires, although my approach allows for this up to the limit of Pico pins.

How do you handle multiple instances of the same task? This can occur inadvertently, and detecting and dealing with this case seemed useful.

Do you have a facility to detect occasional CPU hogging? This was one of my key aims.

@stephanelsmith
Copy link
Sponsor Contributor

stephanelsmith commented Sep 24, 2021

I have neither amazing software logic analyzer or physical one. However, I think the excercise is still worth it. My motiviation is to understand the margin in my async system. Related to process hog, but different. Process hog very useful, especially for identifying timing related bugs and specific optimizations for problem tasks. Margin a bit more a generic utilization figure. I really maybe just need a simple number.

In this section of asyncio.core
` t = _task_queue.peek()

        if t:
            # A task waiting on _task_queue; "ph_key" is time to schedule task at
            dt = max(0, ticks_diff(t.ph_key, ticks()))
        elif not _io_queue.map:
            # No tasks can be woken so finished running
            return
        # print('(poll {})'.format(dt), len(_io_queue.map))
        _io_queue.wait_io_event(dt)

`

dt I think is really what I want to try to track, maybe summed utilization divided by a utilization epoch.

Goal being, 1) how much more can I pack into my application, 2) any low power options during idle? Item 2 is the most important, but have few ideas on the esp platform. Mostly throttle down CPU frequency if I have good margin? Would be better if I could run as fast as possible and then idle on demand as on other embedded platforms. Was optimistic about machine.idle, but doesn't seem to actually do much of anything in my tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants