Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up send() by 30% when there are no receivers. #35

Closed
wants to merge 3 commits into from

Conversation

pfreixes
Copy link

Use Cython if exists to speed up the checking if there are receivers. if Cython cant be found it uses a none optimized version.

A couple of considerations:

  1. Cython module can be marked as required, speeding up all environments. But this is up to you.
  2. The TypeError error raised when more than one sender is given as a param is done AFTER checking the availability of the receivers. it might broken a bit the contract.

Use Cython if exists to speed up the checking if there are receivers,
if Cython cant be found it uses a none optimized version.
@jek
Copy link
Contributor

jek commented Aug 27, 2017

I'll take a look at this. Can you share your benchmark numbers? I'm surprised that this would be faster, as IIRC I settled on the existing approach after benchmarking alternatives and finding the same length comparison being done under the hood in object.c.

@pfreixes
Copy link
Author

Yes sure, here we go

Before:

$ python ../test_blinker.py
Signals per second: 1648456.8747931719

After:

$ python test_blinker.py
Signals per second: 2216522.8374086423

The script used https://gist.github.com/pfreixes/337ac25d606d3ed67c80d4a018fed141

What do you think about the use of Cython of a must?

@jek
Copy link
Contributor

jek commented Aug 28, 2017

These results didn't seem intuitive to me, so I ran a benchmark across a number of versions of python. The results on OSX were interesting. Results are nanoseconds per send() call.

version baseline pr-35-cython % diff pr-35-fallback %diff
2.6.7 278 267 4 445 -37
2.6.8 279 263 6 419 -34
2.6.9 278 268 4 422 -34
2.7.5 277 263 5 418 -34
2.7.6 270 263 3 417 -35
2.7.7 275 263 5 418 -34
2.7.8 275 266 3 419 -34
2.7.9 271 259 5 409 -34
2.7.10 276 260 6 413 -33
2.7.11 570 415 37 693 -18
2.7.12 557 416 34 689 -19
2.7.13 565 414 37 692 -18
3.2.6 612 465 32 686 -11
3.3.6 734 574 28 740 -1
3.4.6 697 585 19 789 -12
3.5.2 768 608 26 827 -7
3.5.3 760 597 27 824 -8
3.6.0 434 391 11 528 -18
3.6.1 439 379 16 534 -18
3.6.2 426 391 9 527 -19

The 2.6 and 2.7 lines hold steady on performance with Cython not providing gains until 2.7.11, at which point this send() gets slow & Cython adds some measurable gains.

Digging further and looking at the pure-python ways of doing this check, it's clear that performance on 2.7 tanked all over the place from 2.7.11 on:

version python (send_cmp) python (send_fn_len_cmp) python (send_len_cmp) python (send_not_cmp) python (send_not_len)
2.6.7 171 319 207 166 201
2.6.8 171 309 204 167 200
2.6.9 171 308 206 167 201
2.7.5 170 314 203 167 202
2.7.6 162 307 199 160 194
2.7.7 168 313 206 167 199
2.7.8 170 316 207 168 200
2.7.9 163 301 198 161 195
2.7.10 168 308 202 164 197
2.7.11 336 598 425 329 395
2.7.12 327 591 429 326 384
2.7.13 327 596 422 327 393
3.2.6 299 544 392 301 351
3.3.6 316 573 405 320 368
3.4.6 301 577 398 297 367
3.5.2 330 613 451 329 411
3.5.3 328 609 451 327 399
3.6.0 193 420 278 193 256
3.6.1 192 419 266 195 251
3.6.2 191 401 277 188 245

From these slowdowns across the board, and the generally OK numbers of 3.6 vs. < 3.6, I have to conclude that if one wants a fast Python 2.7.11+ program, it's going to require a lot of Cython across the board. This one tiny edge is a nanoseconds drop in the bucket.

Curious to hear what version of Python you're running and general results you've found.

@jek
Copy link
Contributor

jek commented Aug 29, 2017

I ran the same suite on Linux: no performance drop in the 2.7 line. This time the Cython variant had a negligible effect.

version baseline (send) pr-35-cython (send) % diff pr-35-fallback (send) %diff
2.6.6 289.2 273.8 6 425.16 -32
2.6.7 286.36 282.56 1 423.02 -32
2.6.8 283.83 271.48 5 414.87 -32
2.6.9 282.44 277.25 2 420.71 -33
2.7.5 276.96 285.97 -3 404.19 -31
2.7.6 274.51 273.78 0 404.23 -32
2.7.7 277.66 270.63 3 414.22 -33
2.7.8 285.06 269.68 6 417.76 -32
2.7.9 282.26 286.07 -1 427.24 -34
2.7.10 273.61 277.28 -1 407.47 -33
2.7.11 276.42 243.52 14 403.76 -32
2.7.12 270.31 240.1 13 409.79 -34
2.7.13 269.08 244.43 10 408.44 -34
3.2.6 381.24 331.42 15 453.78 -16
3.3.6 447.74 378.87 18 520.25 -14
3.4.6 456.26 391.8 16 532.15 -14
3.5.2 460.66 420.34 10 538.7 -14
3.5.3 467.09 415.74 12 544.64 -14
3.6.0 416.66 385.77 8 524.18 -21
3.6.1 418.48 393.61 6 518.22 -19
3.6.2 415.21 389.52 7 521.92 -20

@pfreixes
Copy link
Author

Hi @jek that was a good test!

Regarding the difference between the baseline and the fallback. My fault, I've run the test comparing only the Cython vs the fallback. Looking to your results looks like the fallback performs worst than the baseline code.

I've changed a bit the function to check if there are receivers using the faster one that you got from your benchmarks, using not rather than use the len one. The results are a bit better for the Cython and the fallback.

Baseline Cython Fallback
400.51 356.73 427.17

(*) tests executed using Python 3.6.2

IMHO the subclassing introduced is putting some extra overhead. Modifying the master branch and putting the check of the receivers at first thing to do gives 310.06.

To sum up:

  1. Something happened with 2.7 Python versions prior 2.7.11 and how Cython is compiling the code. If I have time I would like to investigate, in any case, IMHO this is not a stopper.
  2. The subclassing is not a good strategy. The fallback is slower and Cython version might be faster than now.

Is there any chance to consider the following points:

  1. Move the base implementation to Cython
  2. Move the check for the existence of receivers before the check of the positional arguments.

Thoughts?

@jek
Copy link
Contributor

jek commented Aug 29, 2017

I have no objection to moving the receiver check up, so long as the signature can still be checked. A programming error on send() is a dangerous and expensive time-bomb. Much tougher to fix errors in the emitter (usually someone else's software) than the receiver (usually your own software).

The diff at 873d165 moves the receiver check up and throws a bone to nanosecond shaving by omitting the signature check under -O. (Linux timings below.) I'm generally against complexity for micro-speedups, but I am committed to making unsubscribed signals as close to free as possible, and these changes aren't doing any crazy acrobatics.

I've thought about a cython version, but I've found the signals to be fast enough in production, and the general silence from the community on performance over many years also has me leaving well enough alone.

If you do attempt a cython version for your requirements I am interested to follow the development. I would offer the advice to start the port by implementing the Signal API contract rather than following the code internals. The data structures and algorithms in pure-Python blinker are painstakingly chosen to carefully exploit atomic operations in (C)Python for thread safety. So much so that this was the genesis of the library: I ripped it out of another software and shared it because it was so difficult to get the approach right across interpreters. A native cython version puts the GIL in your control and could be quite different internally.

version baseline (send) noop-opt (send) noop-std (send)
2.6.6 266.59 249.82 266.07
2.6.7 265.41 251.44 265.66
2.6.8 263.66 245.0 263.36
2.6.9 261.46 243.38 264.21
2.7.5 256.85 242.93 255.33
2.7.6 252.12 245.5 255.19
2.7.7 262.06 235.49 248.51
2.7.8 255.5 239.17 258.69
2.7.9 258.04 237.43 253.62
2.7.10 248.59 232.54 247.48
2.7.11 249.93 237.34 251.11
2.7.12 245.22 229.9 241.88
2.7.13 246.97 232.42 248.19
3.2.6 343.46 332.29 342.45
3.3.6 410.62 396.69 401.67
3.4.6 428.06 386.11 418.32
3.5.2 464.8 420.22 431.01
3.5.3 454.43 413.25 468.36
3.6.0 370.61 357.68 379.96
3.6.1 393.52 353.94 377.64
3.6.2 376.03 359.36 363.99

@pfreixes
Copy link
Author

Regarding the current status and how it performed till now. IMHO these last years the Python community and the software that is being built around it is getting another perspective: performance.

There are many examples such as uvloop [1], sanic [2], and so on. The requirements have changed, and the needed of a signal system ready to cope high call throughput is IMHO a reality. I came up with this concept few weeks ago trying to implement a function of the default Python loop for Asyncio that tried to gather the load of the system [3], between different discussions that came up in the proposal somebody argued that this kind of system should be done using Instrumentation as Trio is already doing [4]. I've tried that approximation, as a result the performance degraded by a 10% just for free.

Investigating which was the main cause of the degradation I've realized that the usual case, none listeners attached, was not optimized at all and it was putting an extra overhead that might be not necessary. My conclusion was pretty clear: The overhead of the signal system has to be negligible when there are no receivers.

I appreciate the data that you have posted about the current performance, I will take into consideration all of your advice and I will try to work in a new iteration trying to

making unsubscribed signals as close to free as possible

[1] https://github.com/MagicStack/uvloop/tree/master/uvloop
[2] https://github.com/channelcat/sanic
[3] https://mail.python.org/pipermail/async-sig/2017-August/000382.html
[4] https://trio.readthedocs.io/en/latest/reference-core.html#instrument-api

@jek
Copy link
Contributor

jek commented Aug 30, 2017

To be clear, send is well optimized for pure Python. A function call, a couple opcodes and a list allocation is the minimum cost without resorting to extremes like call stack inspection to determine if a return value is used or not.

Low level tight loop instrumentation via blinker signals is probably overkill. Instrumentation code can be very simple and fast. Instrumentation event registration is largely static so there's little need for connect/disconnect protection during signal iteration. No return values. No need for subscribing to specific senders at a framework level. Even trio's instrumentation carries a performance penalty through the 'interface' pattern: a listener on one event slows down all events.

@jek jek closed this Aug 30, 2017
@jek
Copy link
Contributor

jek commented Aug 30, 2017

I personally would use something like the below (and do, for my own projects). If you're instrumenting such a tight loop that a difference of ~60ns is causing a 10% slowdown, I think you could look at inline events. (A single function call in CPython is ~100ns, so I figure this must be a very tight loop or an unrelated programming error.)

https://gist.github.com/jek/4a5713b5968524e5d072d36b9ad99a6b

fire_inline: 226.13 ns/call
fire_dry: 333.29 ns/call
fire_signal: 364.83 ns/call

@pfreixes
Copy link
Author

Regarding

Low level tight loop instrumentation via blinker signals is probably overkill

Absolutely, I was just trying to apply some ideas to other projects like blinker, django-signals that might be improved because they solve the same problem having the same issues. I've never had the intention to use Blinker to instrumentalize the loop.

Regarding the proposal of use "in line" signals seems pretty straight forward and that might suit for the case of the tiny loop. Indeed, I gave a try and I've modified the Asyncio loop to simulate an inline signal [1], the performance degradation is almost negligible. I think that it might work, having just a signal/cb and gathering the proper statistics at each iteration.

Thanks for your time.

[1] pfreixes/cpython@83dbe66

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants