Speed up send() by 30% when there are no receivers. #35

pfreixes · 2017-08-27T21:26:37Z

Use Cython if exists to speed up the checking if there are receivers. if Cython cant be found it uses a none optimized version.

A couple of considerations:

Cython module can be marked as required, speeding up all environments. But this is up to you.
The TypeError error raised when more than one sender is given as a param is done AFTER checking the availability of the receivers. it might broken a bit the contract.

Use Cython if exists to speed up the checking if there are receivers, if Cython cant be found it uses a none optimized version.

jek · 2017-08-27T22:02:50Z

I'll take a look at this. Can you share your benchmark numbers? I'm surprised that this would be faster, as IIRC I settled on the existing approach after benchmarking alternatives and finding the same length comparison being done under the hood in object.c.

pfreixes · 2017-08-28T06:14:53Z

Yes sure, here we go

Before:

$ python ../test_blinker.py
Signals per second: 1648456.8747931719

After:

$ python test_blinker.py
Signals per second: 2216522.8374086423

The script used https://gist.github.com/pfreixes/337ac25d606d3ed67c80d4a018fed141

What do you think about the use of Cython of a must?

jek · 2017-08-28T22:24:11Z

These results didn't seem intuitive to me, so I ran a benchmark across a number of versions of python. The results on OSX were interesting. Results are nanoseconds per send() call.

version	baseline	pr-35-cython	% diff	pr-35-fallback	%diff
2.6.7	278	267	4	445	-37
2.6.8	279	263	6	419	-34
2.6.9	278	268	4	422	-34
2.7.5	277	263	5	418	-34
2.7.6	270	263	3	417	-35
2.7.7	275	263	5	418	-34
2.7.8	275	266	3	419	-34
2.7.9	271	259	5	409	-34
2.7.10	276	260	6	413	-33
2.7.11	570	415	37	693	-18
2.7.12	557	416	34	689	-19
2.7.13	565	414	37	692	-18
3.2.6	612	465	32	686	-11
3.3.6	734	574	28	740	-1
3.4.6	697	585	19	789	-12
3.5.2	768	608	26	827	-7
3.5.3	760	597	27	824	-8
3.6.0	434	391	11	528	-18
3.6.1	439	379	16	534	-18
3.6.2	426	391	9	527	-19

The 2.6 and 2.7 lines hold steady on performance with Cython not providing gains until 2.7.11, at which point this send() gets slow & Cython adds some measurable gains.

Digging further and looking at the pure-python ways of doing this check, it's clear that performance on 2.7 tanked all over the place from 2.7.11 on:

version	python (send_cmp)	python (send_fn_len_cmp)	python (send_len_cmp)	python (send_not_cmp)	python (send_not_len)
2.6.7	171	319	207	166	201
2.6.8	171	309	204	167	200
2.6.9	171	308	206	167	201
2.7.5	170	314	203	167	202
2.7.6	162	307	199	160	194
2.7.7	168	313	206	167	199
2.7.8	170	316	207	168	200
2.7.9	163	301	198	161	195
2.7.10	168	308	202	164	197
2.7.11	336	598	425	329	395
2.7.12	327	591	429	326	384
2.7.13	327	596	422	327	393
3.2.6	299	544	392	301	351
3.3.6	316	573	405	320	368
3.4.6	301	577	398	297	367
3.5.2	330	613	451	329	411
3.5.3	328	609	451	327	399
3.6.0	193	420	278	193	256
3.6.1	192	419	266	195	251
3.6.2	191	401	277	188	245

From these slowdowns across the board, and the generally OK numbers of 3.6 vs. < 3.6, I have to conclude that if one wants a fast Python 2.7.11+ program, it's going to require a lot of Cython across the board. This one tiny edge is a nanoseconds drop in the bucket.

Curious to hear what version of Python you're running and general results you've found.

jek · 2017-08-29T06:25:00Z

I ran the same suite on Linux: no performance drop in the 2.7 line. This time the Cython variant had a negligible effect.

version	baseline (send)	pr-35-cython (send)	% diff	pr-35-fallback (send)	%diff
2.6.6	289.2	273.8	6	425.16	-32
2.6.7	286.36	282.56	1	423.02	-32
2.6.8	283.83	271.48	5	414.87	-32
2.6.9	282.44	277.25	2	420.71	-33
2.7.5	276.96	285.97	-3	404.19	-31
2.7.6	274.51	273.78	0	404.23	-32
2.7.7	277.66	270.63	3	414.22	-33
2.7.8	285.06	269.68	6	417.76	-32
2.7.9	282.26	286.07	-1	427.24	-34
2.7.10	273.61	277.28	-1	407.47	-33
2.7.11	276.42	243.52	14	403.76	-32
2.7.12	270.31	240.1	13	409.79	-34
2.7.13	269.08	244.43	10	408.44	-34
3.2.6	381.24	331.42	15	453.78	-16
3.3.6	447.74	378.87	18	520.25	-14
3.4.6	456.26	391.8	16	532.15	-14
3.5.2	460.66	420.34	10	538.7	-14
3.5.3	467.09	415.74	12	544.64	-14
3.6.0	416.66	385.77	8	524.18	-21
3.6.1	418.48	393.61	6	518.22	-19
3.6.2	415.21	389.52	7	521.92	-20

pfreixes · 2017-08-29T08:13:10Z

Hi @jek that was a good test!

Regarding the difference between the baseline and the fallback. My fault, I've run the test comparing only the Cython vs the fallback. Looking to your results looks like the fallback performs worst than the baseline code.

I've changed a bit the function to check if there are receivers using the faster one that you got from your benchmarks, using not rather than use the len one. The results are a bit better for the Cython and the fallback.

Baseline	Cython	Fallback
400.51	356.73	427.17

(*) tests executed using Python 3.6.2

IMHO the subclassing introduced is putting some extra overhead. Modifying the master branch and putting the check of the receivers at first thing to do gives 310.06.

To sum up:

Something happened with 2.7 Python versions prior 2.7.11 and how Cython is compiling the code. If I have time I would like to investigate, in any case, IMHO this is not a stopper.
The subclassing is not a good strategy. The fallback is slower and Cython version might be faster than now.

Is there any chance to consider the following points:

Move the base implementation to Cython
Move the check for the existence of receivers before the check of the positional arguments.

Thoughts?

jek · 2017-08-29T17:51:18Z

I have no objection to moving the receiver check up, so long as the signature can still be checked. A programming error on send() is a dangerous and expensive time-bomb. Much tougher to fix errors in the emitter (usually someone else's software) than the receiver (usually your own software).

The diff at 873d165 moves the receiver check up and throws a bone to nanosecond shaving by omitting the signature check under -O. (Linux timings below.) I'm generally against complexity for micro-speedups, but I am committed to making unsubscribed signals as close to free as possible, and these changes aren't doing any crazy acrobatics.

I've thought about a cython version, but I've found the signals to be fast enough in production, and the general silence from the community on performance over many years also has me leaving well enough alone.

If you do attempt a cython version for your requirements I am interested to follow the development. I would offer the advice to start the port by implementing the Signal API contract rather than following the code internals. The data structures and algorithms in pure-Python blinker are painstakingly chosen to carefully exploit atomic operations in (C)Python for thread safety. So much so that this was the genesis of the library: I ripped it out of another software and shared it because it was so difficult to get the approach right across interpreters. A native cython version puts the GIL in your control and could be quite different internally.

version	baseline (send)	noop-opt (send)	noop-std (send)
2.6.6	266.59	249.82	266.07
2.6.7	265.41	251.44	265.66
2.6.8	263.66	245.0	263.36
2.6.9	261.46	243.38	264.21
2.7.5	256.85	242.93	255.33
2.7.6	252.12	245.5	255.19
2.7.7	262.06	235.49	248.51
2.7.8	255.5	239.17	258.69
2.7.9	258.04	237.43	253.62
2.7.10	248.59	232.54	247.48
2.7.11	249.93	237.34	251.11
2.7.12	245.22	229.9	241.88
2.7.13	246.97	232.42	248.19
3.2.6	343.46	332.29	342.45
3.3.6	410.62	396.69	401.67
3.4.6	428.06	386.11	418.32
3.5.2	464.8	420.22	431.01
3.5.3	454.43	413.25	468.36
3.6.0	370.61	357.68	379.96
3.6.1	393.52	353.94	377.64
3.6.2	376.03	359.36	363.99

pfreixes · 2017-08-30T06:15:10Z

Regarding the current status and how it performed till now. IMHO these last years the Python community and the software that is being built around it is getting another perspective: performance.

There are many examples such as uvloop [1], sanic [2], and so on. The requirements have changed, and the needed of a signal system ready to cope high call throughput is IMHO a reality. I came up with this concept few weeks ago trying to implement a function of the default Python loop for Asyncio that tried to gather the load of the system [3], between different discussions that came up in the proposal somebody argued that this kind of system should be done using Instrumentation as Trio is already doing [4]. I've tried that approximation, as a result the performance degraded by a 10% just for free.

Investigating which was the main cause of the degradation I've realized that the usual case, none listeners attached, was not optimized at all and it was putting an extra overhead that might be not necessary. My conclusion was pretty clear: The overhead of the signal system has to be negligible when there are no receivers.

I appreciate the data that you have posted about the current performance, I will take into consideration all of your advice and I will try to work in a new iteration trying to

making unsubscribed signals as close to free as possible

[1] https://github.com/MagicStack/uvloop/tree/master/uvloop
[2] https://github.com/channelcat/sanic
[3] https://mail.python.org/pipermail/async-sig/2017-August/000382.html
[4] https://trio.readthedocs.io/en/latest/reference-core.html#instrument-api

jek · 2017-08-30T16:35:32Z

To be clear, send is well optimized for pure Python. A function call, a couple opcodes and a list allocation is the minimum cost without resorting to extremes like call stack inspection to determine if a return value is used or not.

Low level tight loop instrumentation via blinker signals is probably overkill. Instrumentation code can be very simple and fast. Instrumentation event registration is largely static so there's little need for connect/disconnect protection during signal iteration. No return values. No need for subscribing to specific senders at a framework level. Even trio's instrumentation carries a performance penalty through the 'interface' pattern: a listener on one event slows down all events.

jek · 2017-08-30T20:11:48Z

I personally would use something like the below (and do, for my own projects). If you're instrumenting such a tight loop that a difference of ~60ns is causing a 10% slowdown, I think you could look at inline events. (A single function call in CPython is ~100ns, so I figure this must be a very tight loop or an unrelated programming error.)

https://gist.github.com/jek/4a5713b5968524e5d072d36b9ad99a6b

fire_inline: 226.13 ns/call
fire_dry: 333.29 ns/call
fire_signal: 364.83 ns/call

pfreixes · 2017-08-30T21:58:08Z

Regarding

Low level tight loop instrumentation via blinker signals is probably overkill

Absolutely, I was just trying to apply some ideas to other projects like blinker, django-signals that might be improved because they solve the same problem having the same issues. I've never had the intention to use Blinker to instrumentalize the loop.

Regarding the proposal of use "in line" signals seems pretty straight forward and that might suit for the case of the tiny loop. Indeed, I gave a try and I've modified the Asyncio loop to simulate an inline signal [1], the performance degradation is almost negligible. I think that it might work, having just a signal/cb and gathering the proper statistics at each iteration.

Thanks for your time.

[1] pfreixes/cpython@83dbe66

pfreixes added 2 commits August 27, 2017 23:22

Speed up send() by 30% when there are no receivers.

212c4dd

Use Cython if exists to speed up the checking if there are receivers, if Cython cant be found it uses a none optimized version.

Run cython tests for py36 and py27

a2a3aa4

Use not rathern than len

ba79387

jek closed this Aug 30, 2017

github-actions bot locked as resolved and limited conversation to collaborators Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up send() by 30% when there are no receivers. #35

Speed up send() by 30% when there are no receivers. #35

pfreixes commented Aug 27, 2017

jek commented Aug 27, 2017

pfreixes commented Aug 28, 2017

jek commented Aug 28, 2017 •

edited

jek commented Aug 29, 2017

pfreixes commented Aug 29, 2017

jek commented Aug 29, 2017 •

edited

pfreixes commented Aug 30, 2017

jek commented Aug 30, 2017

jek commented Aug 30, 2017

pfreixes commented Aug 30, 2017

Speed up send() by 30% when there are no receivers. #35

Speed up send() by 30% when there are no receivers. #35

Conversation

pfreixes commented Aug 27, 2017

jek commented Aug 27, 2017

pfreixes commented Aug 28, 2017

jek commented Aug 28, 2017 • edited

jek commented Aug 29, 2017

pfreixes commented Aug 29, 2017

jek commented Aug 29, 2017 • edited

pfreixes commented Aug 30, 2017

jek commented Aug 30, 2017

jek commented Aug 30, 2017

pfreixes commented Aug 30, 2017

jek commented Aug 28, 2017 •

edited

jek commented Aug 29, 2017 •

edited