New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: lock-free channels #8899

Open
dvyukov opened this Issue Oct 7, 2014 · 49 comments

Comments

Projects
None yet
@dvyukov
Member

dvyukov commented Oct 7, 2014

Here is design doc:
https://docs.google.com/document/d/1yIAYmbvL3JxOKOjuCyon7JhW4cSv1wy5hC0ApeGMV9s/pub

Here is the implementation:
https://golang.org/cl/12544043/

I've benchmarked the change of a real app where chans were abandoned in favor of other
sync primitives due to perf reasons:
https://groups.google.com/d/msg/golang-dev/0IElw_BbTrk/cGHMdNoHGQEJ
and the change provides 23% end-to-end speedup making chans significantly faster than
the other synchronization.

@rsc rsc added this to the Unplanned milestone Apr 10, 2015

@rsc rsc removed release-none labels Apr 10, 2015

@olekukonko

This comment has been minimized.

olekukonko commented Oct 5, 2015

What is the update on this ?

@odeke-em

This comment has been minimized.

Member

odeke-em commented Mar 12, 2016

Unfortunately the CL seems to have gone stale @dvyukov due to changes to runtime's code.

@OneOfOne

This comment has been minimized.

Contributor

OneOfOne commented Mar 24, 2016

I have implemented an extremely simple lock-free channel here, if there's interest, I'm willing to implement it in the runtime and submit a CL (and re-license it under the Go license of course).

➜ go test -bench=. -benchmem -cpu 1,4,8,32 -benchtime 3s

#   ch := NewSize(runtime.NumCPU())
BenchmarkLFChan         30000000               168 ns/op              40 B/op          4 allocs/op
BenchmarkLFChan-4       30000000               175 ns/op              45 B/op          4 allocs/op
BenchmarkLFChan-8       20000000               205 ns/op              45 B/op          4 allocs/op
BenchmarkLFChan-32      20000000               201 ns/op              45 B/op          4 allocs/op

# ch := make(chan interface{}, runtime.NumCPU())
BenchmarkChan           50000000               115 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         20000000               261 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8         20000000               331 ns/op               8 B/op          1 allocs/op
BenchmarkChan-32        10000000               532 ns/op               8 B/op          1 allocs/op

PASS
ok      github.com/OneOfOne/lfchan      51.663s
@OneOfOne

This comment has been minimized.

Contributor

OneOfOne commented Mar 24, 2016

@dvyukov

This comment has been minimized.

Member

dvyukov commented Mar 24, 2016

@OneOfOne The tricky part will be to support blocking on empty/full chan and select statements.
Also your chan does not seem to be FIFO which is a requirement.
Also you scan the whole array to find an item, this can be very expensive for large chans.

@OneOfOne

This comment has been minimized.

Contributor

OneOfOne commented Mar 24, 2016

@dvyukov I'll work on fixing 2 and 3 and ping you back, and if I get your approval I'll try to figure out 1.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Mar 24, 2016

Your benchmarks look good but any change to channels has to work efficiently with select statements.

@OneOfOne

This comment has been minimized.

Contributor

OneOfOne commented Mar 26, 2016

@dvyukov I updated the package and fixed all your notes.

@ianlancetaylor I added Select support to the package.

➜ go test -bench=. -benchmem -cpu 1,4,8,32 -benchtime 3s
# ch := NewSize(100)
BenchmarkLFChan         20000000               292 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-4       20000000               202 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-8       30000000               161 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-32      20000000               215 ns/op               8 B/op          1 allocs/op
# ch := make(chan interface{}, 100)
BenchmarkChan           10000000               371 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         10000000               378 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8         10000000               506 ns/op               8 B/op          1 allocs/op
BenchmarkChan-32        10000000               513 ns/op               8 B/op          1 allocs/op
  • the reason 32 is slower than 8 is the fact I only have 8 cores, but theoretically, it should get faster with the number of (real) cores you throw at it.
@jonhoo

This comment has been minimized.

jonhoo commented Mar 26, 2016

I'll benchmark up to 80 cores later today and post results.

@jonhoo

This comment has been minimized.

jonhoo commented Mar 26, 2016

On an 80-core Intel host:

$ go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48,64,72,80 -benchtime 10s
PASS
BenchmarkLFChan         30000000               506 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-2       20000000              1107 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-4       10000000              1611 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-8       10000000              1710 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-16      10000000              2165 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-32      10000000              2192 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-48      10000000              2288 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-64      10000000              2354 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-72       5000000              2454 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-80       5000000              2554 ns/op               8 B/op          1 allocs/op
BenchmarkChan           20000000               768 ns/op               8 B/op          1 allocs/op
BenchmarkChan-2         20000000              1188 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         20000000              3215 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8          5000000              3657 ns/op               8 B/op          1 allocs/op
BenchmarkChan-16        10000000              2734 ns/op               8 B/op          1 allocs/op
BenchmarkChan-32         5000000              2387 ns/op               8 B/op          1 allocs/op
BenchmarkChan-48         5000000              2448 ns/op               8 B/op          1 allocs/op
BenchmarkChan-64         5000000              2452 ns/op               8 B/op          1 allocs/op
BenchmarkChan-72         5000000              2552 ns/op               8 B/op          1 allocs/op
BenchmarkChan-80         5000000              2659 ns/op               8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan      436.003s

On a 48-core AMD host:

$ go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48 -benchtime 10s
PASS
BenchmarkLFChan         20000000               734 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-2       10000000              1271 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-4       20000000              1140 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-8       20000000              1097 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-16      10000000              1257 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-32      10000000              1564 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-48      10000000              2172 ns/op               8 B/op          1 allocs/op
BenchmarkChan           20000000               937 ns/op               8 B/op          1 allocs/op
BenchmarkChan-2         20000000              1147 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         10000000              1721 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8         10000000              2372 ns/op               8 B/op          1 allocs/op
BenchmarkChan-16        10000000              2349 ns/op               8 B/op          1 allocs/op
BenchmarkChan-32         5000000              2295 ns/op               8 B/op          1 allocs/op
BenchmarkChan-48         5000000              2753 ns/op               8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan      276.712s

Also, a plot: x

@OneOfOne

This comment has been minimized.

Contributor

OneOfOne commented Mar 26, 2016

@jonhoo thanks for running the benchmark, it is rather weird how different the numbers are, I wonder if it has something to do with having multiple CPU sockets.

@jonhoo

This comment has been minimized.

jonhoo commented Mar 26, 2016

@OneOfOne These machines are also somewhat older, so each core is slower than what you'd find in your laptop nowadays.

@RLH

This comment has been minimized.

Contributor

RLH commented Mar 26, 2016

Any idea why adding HW threads doesn't result in increased parallelism.
Could you describe the HW? How many HW threads per core?

On Sat, Mar 26, 2016 at 2:07 PM, Jon Gjengset notifications@github.com
wrote:

@OneOfOne https://github.com/OneOfOne These machines are also somewhat
older, so each core is slower than what you'd find in your laptop nowadays.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#8899 (comment)

@jonhoo

This comment has been minimized.

jonhoo commented Mar 26, 2016

@RLH: see the linked pages for hardware descriptions. I'm not pinning threads, but I never run with -cpu higher than the number of real cores.

@dvyukov

This comment has been minimized.

Member

dvyukov commented Mar 27, 2016

@OneOfOne There are several problems with this implementation:

  1. It is not FIFO.
  2. Recv/Send can fail spuriously.
  3. Len can return -1.
  4. It does not support blocking and unblocking of goroutines (you must not just spin, but tell scheduler precisely when a goroutine can be descheduled and marked as blocked, and when it must be made runnable again).
  5. The same applies to select: you must not just poll in a loop, you need to tell scheduler when to block and unblock the goroutine.
@OneOfOne

This comment has been minimized.

Contributor

OneOfOne commented Mar 27, 2016

@dvyukov

  1. fixed here.
  2. I can't trigger that (might have gotten fixed with 1).
  3. I'll see if I can fix it, I added a test, but actually have to use -count to trigger it.
  4. That's not possible in user code AFAIK, I was gonna do that once I submit a CL and be able to use runtime funcs.
  5. same as 4
@OneOfOne

This comment has been minimized.

Contributor

OneOfOne commented Mar 27, 2016

@minux

This comment has been minimized.

Member

minux commented Mar 27, 2016

@OneOfOne

This comment has been minimized.

Contributor

OneOfOne commented Mar 28, 2016

@minux I'm not sure what a SPIN model is.

@jonhoo

This comment has been minimized.

jonhoo commented Mar 28, 2016

@OneOfOne ping me if you want me to run another benchmark after making changes.

@OneOfOne

This comment has been minimized.

Contributor

OneOfOne commented Mar 28, 2016

@jonhoo Please do, I believe I fixed all the issues that I could fix without
using some runtime magic.

On 03/28/2016 07:34 AM, Jon Gjengset wrote:

@OneOfOne https://github.com/OneOfOne ping me if you want me to run
another benchmark after making changes.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8899 (comment)

@jonhoo

This comment has been minimized.

jonhoo commented Mar 28, 2016

Seems to be doing worse for many cores now:

80core-intel $ go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48,64,72,80 -benchtime 10s
PASS
BenchmarkLFChan     30000000           624 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-2   10000000          1266 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-4   10000000          2400 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-8    5000000          3423 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-16   5000000          3536 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-32   5000000          3305 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-48   5000000          3674 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-64   3000000          4078 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-72   3000000          4473 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-80   3000000          4539 ns/op           8 B/op          1 allocs/op
BenchmarkChan       20000000           742 ns/op           8 B/op          1 allocs/op
BenchmarkChan-2     20000000          1161 ns/op           8 B/op          1 allocs/op
BenchmarkChan-4      5000000          2412 ns/op           8 B/op          1 allocs/op
BenchmarkChan-8     10000000          1791 ns/op           8 B/op          1 allocs/op
BenchmarkChan-16    10000000          1966 ns/op           8 B/op          1 allocs/op
BenchmarkChan-32    10000000          2386 ns/op           8 B/op          1 allocs/op
BenchmarkChan-48    10000000          2338 ns/op           8 B/op          1 allocs/op
BenchmarkChan-64    10000000          2281 ns/op           8 B/op          1 allocs/op
BenchmarkChan-72    10000000          2298 ns/op           8 B/op          1 allocs/op
BenchmarkChan-80    10000000          2512 ns/op           8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan  526.199s
48core-amd $ go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48 -benchtime 10s
PASS
BenchmarkLFChan     20000000           863 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-2   10000000          1345 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-4   10000000          1257 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-8   10000000          1373 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-16  10000000          2270 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-32   5000000          3315 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-48   3000000          5026 ns/op           8 B/op          1 allocs/op
BenchmarkChan       20000000           969 ns/op           8 B/op          1 allocs/op
BenchmarkChan-2     20000000          1364 ns/op           8 B/op          1 allocs/op
BenchmarkChan-4     10000000          1956 ns/op           8 B/op          1 allocs/op
BenchmarkChan-8      5000000          2792 ns/op           8 B/op          1 allocs/op
BenchmarkChan-16     5000000          2718 ns/op           8 B/op          1 allocs/op
BenchmarkChan-32     5000000          2659 ns/op           8 B/op          1 allocs/op
BenchmarkChan-48     5000000          3101 ns/op           8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan  358.916s

x

@OneOfOne

This comment has been minimized.

Contributor

OneOfOne commented Mar 28, 2016

That's really weird, on my system it is completely different :-/

Thank you nonetheless for taking the time to test it, might be my "fake"
atomic value / spinlock.

➜ go test -bench=. -benchmem -run NONE -cpu 1,4,8 -benchtime 10s
BenchmarkLFChan 50000000 325 ns/op 8
B/op 1 allocs/op
BenchmarkLFChan-4 50000000 270 ns/op 8
B/op 1 allocs/op
BenchmarkLFChan-8 50000000 297 ns/op 8
B/op 1 allocs/op
BenchmarkChan 50000000 362 ns/op 8
B/op 1 allocs/op
BenchmarkChan-4 50000000 383 ns/op 8
B/op 1 allocs/op
BenchmarkChan-8 30000000 500 ns/op 8
B/op 1 allocs/op

On 03/28/2016 07:52 AM, Jon Gjengset wrote:

Seems to be doing worse for many cores now:

80core-intel $ perflock go test -bench=. -benchmem -cpu
1,2,4,8,16,32,48,64,72,80 -benchtime 10s
PASS
BenchmarkLFChan 30000000 624 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-2 10000000 1266 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-4 10000000 2400 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-8 5000000 3423 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-16 5000000 3536 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-32 5000000 3305 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-48 5000000 3674 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-64 3000000 4078 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-72 3000000 4473 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-80 3000000 4539 ns/op 8 B/op 1 allocs/op
BenchmarkChan 20000000 742 ns/op 8 B/op 1 allocs/op
BenchmarkChan-2 20000000 1161 ns/op 8 B/op 1 allocs/op
BenchmarkChan-4 5000000 2412 ns/op 8 B/op 1 allocs/op
BenchmarkChan-8 10000000 1791 ns/op 8 B/op 1 allocs/op
BenchmarkChan-16 10000000 1966 ns/op 8 B/op 1 allocs/op
BenchmarkChan-32 10000000 2386 ns/op 8 B/op 1 allocs/op
BenchmarkChan-48 10000000 2338 ns/op 8 B/op 1 allocs/op
BenchmarkChan-64 10000000 2281 ns/op 8 B/op 1 allocs/op
BenchmarkChan-72 10000000 2298 ns/op 8 B/op 1 allocs/op
BenchmarkChan-80 10000000 2512 ns/op 8 B/op 1 allocs/op
ok github.com/OneOfOne/lfchan 526.199s
48core-amd $ perflock go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48
-benchtime 10s
PASS
BenchmarkLFChan 20000000 863 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-2 10000000 1345 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-4 10000000 1257 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-8 10000000 1373 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-16 10000000 2270 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-32 5000000 3315 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-48 3000000 5026 ns/op 8 B/op 1 allocs/op
BenchmarkChan 20000000 969 ns/op 8 B/op 1 allocs/op
BenchmarkChan-2 20000000 1364 ns/op 8 B/op 1 allocs/op
BenchmarkChan-4 10000000 1956 ns/op 8 B/op 1 allocs/op
BenchmarkChan-8 5000000 2792 ns/op 8 B/op 1 allocs/op
BenchmarkChan-16 5000000 2718 ns/op 8 B/op 1 allocs/op
BenchmarkChan-32 5000000 2659 ns/op 8 B/op 1 allocs/op
BenchmarkChan-48 5000000 3101 ns/op 8 B/op 1 allocs/op
ok github.com/OneOfOne/lfchan 358.916s

x
https://cloud.githubusercontent.com/assets/176295/14071848/8ec0d1ee-f487-11e5-9895-9f9fdc8e089b.png


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8899 (comment)

@jonhoo

This comment has been minimized.

jonhoo commented Mar 28, 2016

Well, as I mentioned previously, the cores are probably slower than what's in your laptop. The trend is what's important, and it's clearly not great. Looking at cpu utilization during the 80-core benchmark, all cores are active. If it's at all helpful, some profiling results:

(pprof) top20
27010ms of 29760ms total (90.76%)
Dropped 100 nodes (cum <= 148.80ms)
Showing top 20 nodes out of 57 (cum >= 9970ms)
      flat  flat%   sum%        cum   cum%
   12060ms 40.52% 40.52%    12060ms 40.52%  runtime.futex
    3020ms 10.15% 50.67%     3020ms 10.15%  runtime/internal/atomic.Xchg
    1620ms  5.44% 56.12%     1620ms  5.44%  sync/atomic.AddUint32
    1540ms  5.17% 61.29%     5940ms 19.96%  github.com/OneOfOne/lfchan.Chan.Send
    1370ms  4.60% 65.89%     8060ms 27.08%  runtime.lock
    1130ms  3.80% 69.69%     1130ms  3.80%  runtime/internal/atomic.Cas
    1030ms  3.46% 73.15%     1030ms  3.46%  sync/atomic.CompareAndSwapUint32
    1020ms  3.43% 76.58%     1020ms  3.43%  runtime.globrunqget
     590ms  1.98% 78.56%      590ms  1.98%  runtime/internal/atomic.Xadd
     590ms  1.98% 80.54%      590ms  1.98%  sync/atomic.LoadUint32
     450ms  1.51% 82.06%      450ms  1.51%  runtime.procyield
     410ms  1.38% 83.43%     4400ms 14.78%  github.com/OneOfOne/lfchan.Chan.Recv
     330ms  1.11% 84.54%     7870ms 26.44%  runtime.schedule
     300ms  1.01% 85.55%      300ms  1.01%  runtime.memmove
     300ms  1.01% 86.56%      300ms  1.01%  runtime.osyield
     300ms  1.01% 87.57%      300ms  1.01%  sync/atomic.AddUint64
     290ms  0.97% 88.54%      450ms  1.51%  runtime.mallocgc
     260ms  0.87% 89.42%      670ms  2.25%  runtime.execute
     200ms  0.67% 90.09%      200ms  0.67%  runtime.gogo
     200ms  0.67% 90.76%     9970ms 33.50%  runtime.unlock
@jonhoo

This comment has been minimized.

jonhoo commented Mar 28, 2016

Comparatively, here's the profile from BenchmarkChan:

(pprof) top20
22140ms of 25220ms total (87.79%)
Dropped 107 nodes (cum <= 126.10ms)
Showing top 20 nodes out of 69 (cum >= 330ms)
      flat  flat%   sum%        cum   cum%
   12490ms 49.52% 49.52%    12490ms 49.52%  runtime.futex
    1500ms  5.95% 55.47%     1500ms  5.95%  runtime/internal/atomic.Xchg
    1290ms  5.11% 60.59%     1290ms  5.11%  runtime.memmove
     900ms  3.57% 64.16%     7210ms 28.59%  runtime.lock
     610ms  2.42% 66.57%      610ms  2.42%  runtime/internal/atomic.Xadd
     600ms  2.38% 68.95%     5890ms 23.35%  runtime.chansend
     530ms  2.10% 71.05%      550ms  2.18%  testing.(*PB).Next
     490ms  1.94% 73.00%      490ms  1.94%  runtime/internal/atomic.Cas
     480ms  1.90% 74.90%      480ms  1.90%  runtime.procyield
     440ms  1.74% 76.65%      440ms  1.74%  runtime.osyield
     410ms  1.63% 78.27%      410ms  1.63%  runtime.gopark
     360ms  1.43% 79.70%     1790ms  7.10%  runtime.schedule
     340ms  1.35% 81.05%      340ms  1.35%  sync/atomic.AddUint64
     330ms  1.31% 82.36%      460ms  1.82%  runtime.execute
     310ms  1.23% 83.58%      720ms  2.85%  runtime.goparkunlock
     240ms  0.95% 84.54%      440ms  1.74%  runtime.mallocgc
     240ms  0.95% 85.49%      240ms  0.95%  runtime/internal/atomic.Load
     200ms  0.79% 86.28%      200ms  0.79%  runtime.(*waitq).dequeue
     200ms  0.79% 87.07%      200ms  0.79%  runtime/internal/atomic.Cas64
     180ms  0.71% 87.79%      330ms  1.31%  runtime.casgstatus
@davecheney

This comment has been minimized.

Contributor

davecheney commented Mar 28, 2016

Jon, can you post the svg or pdf of that profile please.

On Mon, Mar 28, 2016 at 5:14 PM, Jon Gjengset notifications@github.com
wrote:

Well, as I mentioned previously, the cores are probably slower than what's
in your laptop. The trend is what's important, and it's clearly not great.
Looking at cpu utilization during the 80-core benchmark, all cores are
active. If it's at all helpful, some profiling results:

(pprof) top20
27010ms of 29760ms total (90.76%)
Dropped 100 nodes (cum <= 148.80ms)
Showing top 20 nodes out of 57 (cum >= 9970ms)
flat flat% sum% cum cum%
12060ms 40.52% 40.52% 12060ms 40.52% runtime.futex
3020ms 10.15% 50.67% 3020ms 10.15% runtime/internal/atomic.Xchg
1620ms 5.44% 56.12% 1620ms 5.44% sync/atomic.AddUint32
1540ms 5.17% 61.29% 5940ms 19.96% github.com/OneOfOne/lfchan.Chan.Send
1370ms 4.60% 65.89% 8060ms 27.08% runtime.lock
1130ms 3.80% 69.69% 1130ms 3.80% runtime/internal/atomic.Cas
1030ms 3.46% 73.15% 1030ms 3.46% sync/atomic.CompareAndSwapUint32
1020ms 3.43% 76.58% 1020ms 3.43% runtime.globrunqget
590ms 1.98% 78.56% 590ms 1.98% runtime/internal/atomic.Xadd
590ms 1.98% 80.54% 590ms 1.98% sync/atomic.LoadUint32
450ms 1.51% 82.06% 450ms 1.51% runtime.procyield
410ms 1.38% 83.43% 4400ms 14.78% github.com/OneOfOne/lfchan.Chan.Recv
330ms 1.11% 84.54% 7870ms 26.44% runtime.schedule
300ms 1.01% 85.55% 300ms 1.01% runtime.memmove
300ms 1.01% 86.56% 300ms 1.01% runtime.osyield
300ms 1.01% 87.57% 300ms 1.01% sync/atomic.AddUint64
290ms 0.97% 88.54% 450ms 1.51% runtime.mallocgc
260ms 0.87% 89.42% 670ms 2.25% runtime.execute
200ms 0.67% 90.09% 200ms 0.67% runtime.gogo
200ms 0.67% 90.76% 9970ms 33.50% runtime.unlock


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#8899 (comment)

@jonhoo

This comment has been minimized.

jonhoo commented Mar 28, 2016

BenchmarkLFChan:
lfchan
BenchmarkChan:
std

@davecheney

This comment has been minimized.

Contributor

davecheney commented Mar 28, 2016

@OneOfOne why is your code calling runtime.GOMAXPROCS constantly, your code should assume that value won't change.

@kostya-sh

This comment has been minimized.

Contributor

kostya-sh commented Mar 28, 2016

I believe benchmarks in https://github.com/OneOfOne/lfchan/blob/master/chan_test.go are not benchmarking what you want. Operations like wg.Add, wg.Done and creating many goroutines probably limit scalability and skew results.

I think a better approach would be to use 1 reader and 1 writer goroutine and let RunParallel handle concurrency level for you. Also avoid using atomic operations and wait groups.

Also I want to point out that the following result:

BenchmarkLFChan         50000000               325 ns/op
BenchmarkLFChan-4       50000000               270 ns/op

shows that the benchmarked operation became 270*4/325=3.3 times slower.

@davecheney

This comment has been minimized.

Contributor

davecheney commented Mar 28, 2016

For a lock free implementation, this code sure does spend a long time in runtime.lock.

@OneOfOne

This comment has been minimized.

Contributor

OneOfOne commented Mar 28, 2016

@jonhoo I fixed that, would you be kind to rerun the benchmarks, I simplified them as per @kostya-sh's suggestion and removed the GOMAXPROCS call.

@dvyukov pointed me to the right direction with his runtime.lock remark.

➜ go test -bench=. -benchmem  -run NONE -cpu 1,4,8 -benchtime 10s
BenchmarkLFChan         100000000              142 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-4       100000000              174 ns/op               8 B/op          1 allocs/op
BenchmarkLFChan-8       100000000              132 ns/op               8 B/op          1 allocs/op
BenchmarkChan           100000000              102 ns/op               8 B/op          1 allocs/op
BenchmarkChan-4         50000000               257 ns/op               8 B/op          1 allocs/op
BenchmarkChan-8         50000000               337 ns/op               8 B/op          1 allocs/op
PASS
ok      github.com/OneOfOne/lfchan      86.061
@bradfitz

This comment has been minimized.

Member

bradfitz commented Mar 28, 2016

This bug was originally about lock-free channels in the runtime.

All this recent discussion is about an external package. Can we move discussion to elsewhere on Github (e.g. https://github.com/OneOfOne/lfchan/issues) so I can remain subscribed to this bug for its original purpose?

@jonhoo

This comment has been minimized.

jonhoo commented Mar 28, 2016

Created an issue over at OneOfOne/lfchan#3. Will continue posting benchmark results there.

@dvyukov

This comment has been minimized.

Member

dvyukov commented Apr 3, 2016

@OneOfOne The queue is still not FIFO after 2f9d3a217eadbc557e758649fa153dd59ff14c11 and len still can return -1. You can block and unblock goroutines without runtime support, sync.Mutex+Cond can block/unblock goroutines. You can stab runtime scheduler by implementing 2 functions on top of Mutex+Cond: park(g *G) and unpark(g *G).

@gorakhargosh

This comment has been minimized.

gorakhargosh commented Apr 23, 2016

@jonhoo how did you generate the plot?

@jonhoo

This comment has been minimized.

@gyf19

This comment has been minimized.

gyf19 commented Dec 27, 2016

@dvyukov ping?

@dvyukov

This comment has been minimized.

Member

dvyukov commented Dec 27, 2016

After the recent fairness changes in channels, my algorithm does not apply per se. My algorithm does not have the fairness properties that the current code provide.

@olekukonko

This comment has been minimized.

olekukonko commented Jan 6, 2017

@dvyukov any plans to revisit this in future ?

@dvyukov

This comment has been minimized.

Member

dvyukov commented Jan 6, 2017

No particular plans.

@stjepang

This comment has been minimized.

stjepang commented Apr 6, 2017

@dvyukov

What are the fairness properties this implementation is missing?
Other than that, is there anything else wrong with it?

@dvyukov

This comment has been minimized.

Member

dvyukov commented Apr 7, 2017

@stjepang I've lost all context already. What implementation do you mean? Why do you think there are fairness problems?

@stjepang

This comment has been minimized.

stjepang commented Apr 7, 2017

@dvyukov In order to improve the performance of Go's channels, you designed almost lock-free channels and wrote this implementation. Now it's apparently abandonded, as you commented:

After the recent fairness changes in channels, my algorithm does not apply per se. My algorithm does not have the fairness properties that the current code provide.

I was wondering what are the fairness changes that make your algorithm inapplicable. Why exactly your algorithm didn't make it as the official Go channel implementation? What are the problems with it that still need to be solved?

@dvyukov

This comment has been minimized.

Member

dvyukov commented Apr 7, 2017

I was wondering what are the fairness changes that make your algorithm inapplicable.

I mean this change:
https://go-review.googlesource.com/c/16740
Though, it does not mention fairness nor adds tests. So probably we can break the additional fairness guarantees that it provides.

Why exactly your algorithm didn't make it as the official Go channel implementation? What are the problems with it that still need to be solved?

It was constantly deprioritized over other runtime changes, thus constantly get outdated and I wasn't able to keep up updating it. As far as I remember it all worked. Now the major problem is rebasing it onto current runtime code. Also need to decide what to do with respect to the fairness change 16740.

@olekukonko

This comment has been minimized.

olekukonko commented Apr 7, 2017

@dvyukovI, I might be wrong but I noticed you have not been active like before. Is this as a result of your busy schedule or you lost the zeal as a result of multiple rejections of some of your proposal or working on a secret project we can't know about yet :)

@dvyukov

This comment has been minimized.

Member

dvyukov commented Apr 7, 2017

@olekukonko working on a non-secret project you can know about: https://github.com/google/syzkaller

@olekukonko

This comment has been minimized.

olekukonko commented Apr 7, 2017

@dvyukov, the project looks Interesting, I will be experimenting with it and thanks for the prompt response. Regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment