New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up send() by 30% when there are no receivers. #35
Conversation
Use Cython if exists to speed up the checking if there are receivers, if Cython cant be found it uses a none optimized version.
I'll take a look at this. Can you share your benchmark numbers? I'm surprised that this would be faster, as IIRC I settled on the existing approach after benchmarking alternatives and finding the same length comparison being done under the hood in object.c. |
Yes sure, here we go Before: $ python ../test_blinker.py
Signals per second: 1648456.8747931719 After: $ python test_blinker.py
Signals per second: 2216522.8374086423 The script used https://gist.github.com/pfreixes/337ac25d606d3ed67c80d4a018fed141 What do you think about the use of Cython of a must? |
These results didn't seem intuitive to me, so I ran a benchmark across a number of versions of python. The results on OSX were interesting. Results are nanoseconds per send() call.
The 2.6 and 2.7 lines hold steady on performance with Cython not providing gains until 2.7.11, at which point this Digging further and looking at the pure-python ways of doing this check, it's clear that performance on 2.7 tanked all over the place from 2.7.11 on:
From these slowdowns across the board, and the generally OK numbers of 3.6 vs. < 3.6, I have to conclude that if one wants a fast Python 2.7.11+ program, it's going to require a lot of Cython across the board. This one tiny edge is a nanoseconds drop in the bucket. Curious to hear what version of Python you're running and general results you've found. |
I ran the same suite on Linux: no performance drop in the 2.7 line. This time the Cython variant had a negligible effect.
|
Hi @jek that was a good test! Regarding the difference between the baseline and the fallback. My fault, I've run the test comparing only the Cython vs the fallback. Looking to your results looks like the fallback performs worst than the baseline code. I've changed a bit the function to check if there are receivers using the faster one that you got from your benchmarks, using
(*) tests executed using Python 3.6.2 IMHO the subclassing introduced is putting some extra overhead. Modifying the master branch and putting the check of the receivers at first thing to do gives To sum up:
Is there any chance to consider the following points:
Thoughts? |
I have no objection to moving the receiver check up, so long as the signature can still be checked. A programming error on The diff at 873d165 moves the receiver check up and throws a bone to nanosecond shaving by omitting the signature check under I've thought about a cython version, but I've found the signals to be fast enough in production, and the general silence from the community on performance over many years also has me leaving well enough alone. If you do attempt a cython version for your requirements I am interested to follow the development. I would offer the advice to start the port by implementing the Signal API contract rather than following the code internals. The data structures and algorithms in pure-Python blinker are painstakingly chosen to carefully exploit atomic operations in (C)Python for thread safety. So much so that this was the genesis of the library: I ripped it out of another software and shared it because it was so difficult to get the approach right across interpreters. A native cython version puts the GIL in your control and could be quite different internally.
|
Regarding the current status and how it performed till now. IMHO these last years the Python community and the software that is being built around it is getting another perspective: performance. There are many examples such as uvloop [1], sanic [2], and so on. The requirements have changed, and the needed of a signal system ready to cope high call throughput is IMHO a reality. I came up with this concept few weeks ago trying to implement a function of the default Python loop for Asyncio that tried to gather the load of the system [3], between different discussions that came up in the proposal somebody argued that this kind of system should be done using Instrumentation as Trio is already doing [4]. I've tried that approximation, as a result the performance degraded by a 10% just for free. Investigating which was the main cause of the degradation I've realized that the usual case, none listeners attached, was not optimized at all and it was putting an extra overhead that might be not necessary. My conclusion was pretty clear: The overhead of the signal system has to be negligible when there are no receivers. I appreciate the data that you have posted about the current performance, I will take into consideration all of your advice and I will try to work in a new iteration trying to
[1] https://github.com/MagicStack/uvloop/tree/master/uvloop |
To be clear, Low level tight loop instrumentation via blinker signals is probably overkill. Instrumentation code can be very simple and fast. Instrumentation event registration is largely static so there's little need for connect/disconnect protection during signal iteration. No return values. No need for subscribing to specific senders at a framework level. Even trio's instrumentation carries a performance penalty through the 'interface' pattern: a listener on one event slows down all events. |
I personally would use something like the below (and do, for my own projects). If you're instrumenting such a tight loop that a difference of ~60ns is causing a 10% slowdown, I think you could look at inline events. (A single function call in CPython is ~100ns, so I figure this must be a very tight loop or an unrelated programming error.) https://gist.github.com/jek/4a5713b5968524e5d072d36b9ad99a6b fire_inline: 226.13 ns/call |
Regarding
Absolutely, I was just trying to apply some ideas to other projects like blinker, django-signals that might be improved because they solve the same problem having the same issues. I've never had the intention to use Blinker to instrumentalize the loop. Regarding the proposal of use "in line" signals seems pretty straight forward and that might suit for the case of the tiny loop. Indeed, I gave a try and I've modified the Asyncio loop to simulate an inline signal [1], the performance degradation is almost negligible. I think that it might work, having just a signal/cb and gathering the proper statistics at each iteration. Thanks for your time. |
Use Cython if exists to speed up the checking if there are receivers. if Cython cant be found it uses a none optimized version.
A couple of considerations:
TypeError
error raised when more than one sender is given as a param is done AFTER checking the availability of the receivers. it might broken a bit the contract.