Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) #52652

Open
alphadose opened this issue May 2, 2022 · 28 comments
Labels
NeedsInvestigation
Milestone

Comments

@alphadose
Copy link

@alphadose alphadose commented May 2, 2022

Current implementation of channels could use some improvements in terms of ns/op, B/op and allocs/op

I have made a POC thread-safe queue which has better performance in all the above 3 metrics https://github.com/alphadose/ZenQ

Here are the benchmarks vs native channels https://github.com/alphadose/ZenQ#benchmarks for a wide range of use cases

My proposal is to make the underlying implementation of native channels more efficient by changing the algorithm and data structure to the lock-free one used in ZenQ or some other suitable implementation (debatable)

@alphadose alphadose changed the title Make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) proposal: make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) May 2, 2022
@gopherbot gopherbot added this to the Proposal milestone May 2, 2022
@ianlancetaylor

This comment has been hidden.

@ianlancetaylor ianlancetaylor added NeedsInvestigation and removed Proposal labels May 2, 2022
@ianlancetaylor ianlancetaylor removed this from the Proposal milestone May 2, 2022
@ianlancetaylor ianlancetaylor added this to the Backlog milestone May 2, 2022
@ianlancetaylor ianlancetaylor added the WaitingForInfo label May 2, 2022
@alphadose

This comment has been hidden.

@ianlancetaylor

This comment has been hidden.

@ianlancetaylor ianlancetaylor removed the WaitingForInfo label May 2, 2022
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 2, 2022

I took a very quick look. Channels aren't just thread-safe queues; they also have to handle the select statement and the close function. Can we retain the speed improvements while still implementing all required features?

CC @golang/runtime

@ianlancetaylor ianlancetaylor changed the title proposal: make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) runtime: make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) May 2, 2022
@alphadose
Copy link
Author

@alphadose alphadose commented May 2, 2022

Yep, I wanted that only

Syntactic Sugar of channels + awesome perf

Since channels are used a lot, this might help people out

@alphadose
Copy link
Author

@alphadose alphadose commented May 2, 2022

PS:- this is just a POC, as you said it needs to be fully compatible with the linked methods in https://go.dev/src/runtime/chan.go

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented May 3, 2022

Similarly, the implementation busy-spins when it waits, which is not suitable for the default implementation.

Also, there's a separate issue for having lock-free channels #8899

@alphadose
Copy link
Author

@alphadose alphadose commented May 3, 2022

The spinning is applied on a slot in the queue whose size can range from 2 to 2^64 - 1

Due to the lack of a global lock, the spinning is used for contesting slots with slot specific states due to which there are low number of collisions among reader and writer indexes due to the large number of slots and also a lot less time is spent in the Compare and Swap for loops because each lock/state has its own individual memory address. But you are right, in case there are a lot of concurrent writer goroutines (1 million or more) its better to use exponential backoff or just make the goroutine sleep/park to reduce slot contention and optimize even more

As for #8899, my implementation has lower allocs/op in comparison to that lock-free buffer as well as the native channel and is even a lot faster

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented May 3, 2022

You need to implement select, select fairness, parking and closing etc. before you can draw any conclusions on the performance. Of course, you can make a faster implementation when you discard some of the requirements.

Similarly, also run the benchmarks that are in https://github.com/golang/go/blob/master/src/runtime/chan_test.go. Also you need to show benchmarks such as 1M blocked writers. If you are not demonstrating where your code is worse, then it's not a fair evaluation.

@alphadose
Copy link
Author

@alphadose alphadose commented May 3, 2022

Got it, I will draw another benchmark to replicate the 1M blocked writers scenario for both channels and zenQ and note down the resource utilization

@alphadose
Copy link
Author

@alphadose alphadose commented May 3, 2022

After that, we can have a look at how to implement select {} like mechanism between multiple zenQs

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented May 3, 2022

Note for benchmarking please see https://pkg.go.dev/golang.org/x/tools/cmd/benchcmp and http://www.inanzzz.com/index.php/post/yz8n/using-golang-bench-benchstat-and-benchcmp-to-measure-performance. You need statistical significance when measuring.

I also recommend reading through the lock-free channel issue fully, the design documents and previous pull-requests, before continuing. That way it'll avoid some topics that have come up before.

@alphadose
Copy link
Author

@alphadose alphadose commented May 4, 2022

@egonelbre I was able to implement parking thereby saving a ton of goroutines, and now ZenQ is more efficient than channels even in the case of 1 million blocking goroutines

Here are the latest benchmarks run over 30 cases

name                                     time/op
_Chan_NumWriters1_InputSize600-8         24.6µs ± 1%
_ZenQ_NumWriters1_InputSize600-8         16.5µs ± 1%
_Chan_NumWriters3_InputSize60000-8       6.21ms ± 2%
_ZenQ_NumWriters3_InputSize60000-8       2.85ms ± 0%
_Chan_NumWriters8_InputSize6000000-8      735ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8      417ms ± 0%
_Chan_NumWriters100_InputSize6000000-8    1.61s ± 1%
_ZenQ_NumWriters100_InputSize6000000-8    741ms ± 3%
_Chan_NumWriters1000_InputSize7000000-8   1.98s ± 0%
_ZenQ_NumWriters1000_InputSize7000000-8   1.05s ± 1%
_Chan_Million_Blocking_Writers-8          10.0s ±13%
_ZenQ_Million_Blocking_Writers-8          7.01s ±44%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8          0.00B
_ZenQ_NumWriters1_InputSize600-8          0.00B
_Chan_NumWriters3_InputSize60000-8         106B ±88%
_ZenQ_NumWriters3_InputSize60000-8       28.9B ±111%
_Chan_NumWriters8_InputSize6000000-8      946B ±267%
_ZenQ_NumWriters8_InputSize6000000-8      885B ±163%
_Chan_NumWriters100_InputSize6000000-8   46.7kB ±25%
_ZenQ_NumWriters100_InputSize6000000-8   16.2kB ±66%
_Chan_NumWriters1000_InputSize7000000-8   484kB ±10%
_ZenQ_NumWriters1000_InputSize7000000-8  62.4kB ±82%
_Chan_Million_Blocking_Writers-8          553MB ± 0%
_ZenQ_Million_Blocking_Writers-8         95.9MB ± 0%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8           0.00
_ZenQ_NumWriters1_InputSize600-8           0.00
_Chan_NumWriters3_InputSize60000-8         0.00
_ZenQ_NumWriters3_InputSize60000-8         0.00
_Chan_NumWriters8_InputSize6000000-8      3.07 ±193%
_ZenQ_NumWriters8_InputSize6000000-8      2.07 ±142%
_Chan_NumWriters100_InputSize6000000-8      166 ±15%
_ZenQ_NumWriters100_InputSize6000000-8     53.5 ±50%
_Chan_NumWriters1000_InputSize7000000-8   1.74k ± 7%
_ZenQ_NumWriters1000_InputSize7000000-8     525 ±39%
_Chan_Million_Blocking_Writers-8          2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8          1.00M ± 0%

Further improvements:-

More improvements can be done by using runtime internal calls like goparkunlock() , getg() and goready() to schedule goroutines more effectively

But this requires assembly stubs to load the *g pointer, here is my current implementation please have a look at it https://github.com/alphadose/ZenQ/blob/main/lib_runtime_linkage.go and suggest if there are any better methods

@alphadose
Copy link
Author

@alphadose alphadose commented May 6, 2022

@egonelbre I was able to implement select{}, which does fair selection based on the least number of reads among various queues, implementation -> here

Here is an example code demonstrating its usage

package main

import (
	"fmt"

	"github.com/alphadose/zenq"
)

type custom1 struct {
	alpha int
	beta  string
}

type custom2 struct {
	gamma int
}

var (
	zq1 = zenq.New[int]()
	zq2 = zenq.New[string]()
	zq3 = zenq.New[custom1]()
	zq4 = zenq.New[*custom2]()
)

func main() {
	go looper(intProducer)
	go looper(stringProducer)
	go looper(custom1Producer)
	go looper(custom2Producer)

	for i := 0; i < 40; i++ {

		// Selection occurs here
		switch data := zenq.Select(zq1, zq2, zq3, zq4).(type) {
		case int:
			fmt.Printf("Received int %d\n", data)
		case string:
			fmt.Printf("Received string %s\n", data)
		case custom1:
			fmt.Printf("Received custom data type number 1 %#v\n", data)
		case *custom2:
			fmt.Printf("Received pointer %#v\n", data)
		}
	}
}

func intProducer(ctr int) { zq1.Write(ctr) }

func stringProducer(ctr int) { zq2.Write(fmt.Sprint(ctr * 10)) }

func custom1Producer(ctr int) { zq3.Write(custom1{alpha: ctr, beta: fmt.Sprint(ctr)}) }

func custom2Producer(ctr int) { zq4.Write(&custom2{gamma: 1 << ctr}) }

func looper(producer func(ctr int)) {
	for i := 0; i < 10; i++ {
		producer(i)
	}
}

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented May 6, 2022

@alphadose it's a polling implementation, which it should not be. The select contains a lockup with regards multiple selects on the same channel, even when there is a possibility for usual select to proceed.

Similarly, the fairness guarantee doesn't seem to hold:

func main() {
	zq1 := zenq.New[int32]()
	zq2 := zenq.New[uint32]()

	go func() { for { zq1.Write(1) } }()
	go func() { for { zq2.Write(1) } }()

	for i := 0; i < 1e3; i++ { zq1.Read() }

	for i := 0; i < 100; i++ {
		data := zenq.Select(zq1, zq2)
		fmt.Println(data) // this only currently prints 2; however the select should be fair between zq1 and zq2
	}
}

I'm all for implementing custom primitives for niche cases, however, I would recommend trying to improve the performance of the current implementation, rather than implementing all the minutia from scratch.

@alphadose
Copy link
Author

@alphadose alphadose commented May 6, 2022

Agreed regarding the fairness

Also for the performance, I am currently testing by writing assembly to load the *g pointer from thread local storage and then try benchmarking with parking/unparking implementation, hopefully we can see some good improvement

@alphadose
Copy link
Author

@alphadose alphadose commented May 9, 2022

@egonelbre I was able to get a good improvement in performance by using gopark() and goready(). Here is the comparison benchmark with the old algorithm run over 30 cases and queue_size = 4096

name                                     old time/op    new time/op    delta
_ZenQ_NumWriters1_InputSize600-8           16.5µs ± 1%    17.9µs ± 1%   +8.65%  (p=0.000 n=28+29)
_ZenQ_NumWriters3_InputSize60000-8         2.85ms ± 0%    2.67ms ± 6%   -6.11%  (p=0.000 n=23+30)
_ZenQ_NumWriters8_InputSize6000000-8        417ms ± 0%     313ms ± 5%  -24.83%  (p=0.000 n=23+29)
_ZenQ_NumWriters100_InputSize6000000-8      741ms ± 3%     516ms ± 2%  -30.40%  (p=0.000 n=29+30)
_ZenQ_NumWriters1000_InputSize7000000-8     1.05s ± 1%     0.45s ± 9%  -57.58%  (p=0.000 n=28+30)
_ZenQ_Million_Blocking_Writers-8            7.01s ±44%    10.98s ± 4%  +56.54%  (p=0.000 n=30+28)

name                                     old alloc/op   new alloc/op   delta
_ZenQ_NumWriters1_InputSize600-8            0.00B          0.00B          ~     (all equal)
_ZenQ_NumWriters3_InputSize60000-8         28.9B ±111%    34.8B ±127%     ~     (p=0.268 n=30+29)
_ZenQ_NumWriters8_InputSize6000000-8        885B ±163%     671B ±222%     ~     (p=0.208 n=30+30)
_ZenQ_NumWriters100_InputSize6000000-8     16.2kB ±66%   13.3kB ±100%     ~     (p=0.072 n=30+30)
_ZenQ_NumWriters1000_InputSize7000000-8    62.4kB ±82%    2.4kB ±210%  -96.20%  (p=0.000 n=30+30)
_ZenQ_Million_Blocking_Writers-8           95.9MB ± 0%    95.5MB ± 0%   -0.41%  (p=0.000 n=28+30)

name                                     old allocs/op  new allocs/op  delta
_ZenQ_NumWriters1_InputSize600-8             0.00           0.00          ~     (all equal)
_ZenQ_NumWriters3_InputSize60000-8           0.00           0.00          ~     (all equal)
_ZenQ_NumWriters8_InputSize6000000-8        2.07 ±142%     1.40 ±186%     ~     (p=0.081 n=30+30)
_ZenQ_NumWriters100_InputSize6000000-8       53.5 ±50%     31.8 ±100%  -40.60%  (p=0.000 n=30+30)
_ZenQ_NumWriters1000_InputSize7000000-8       525 ±39%        6 ±227%  -98.95%  (p=0.000 n=30+30)
_ZenQ_Million_Blocking_Writers-8            1.00M ± 0%     0.99M ± 0%   -0.41%  (p=0.000 n=28+29)

Here is the current comparison vs channels also run over 30 cases and queue_size = 4096

name                                     time/op
_Chan_NumWriters1_InputSize600-8          23.3µs ± 1%
_ZenQ_NumWriters1_InputSize600-8          17.9µs ± 1%
_Chan_NumWriters3_InputSize60000-8        5.48ms ± 3%
_ZenQ_NumWriters3_InputSize60000-8        2.67ms ± 6%
_Chan_NumWriters8_InputSize6000000-8       679ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8       313ms ± 5%
_Chan_NumWriters100_InputSize6000000-8     1.58s ± 1%
_ZenQ_NumWriters100_InputSize6000000-8     516ms ± 2%
_Chan_NumWriters1000_InputSize7000000-8    1.97s ± 1%
_ZenQ_NumWriters1000_InputSize7000000-8    445ms ± 9%
_Chan_Million_Blocking_Writers-8           10.8s ± 1%
_ZenQ_Million_Blocking_Writers-8           11.0s ± 4%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8           0.00B
_ZenQ_NumWriters1_InputSize600-8           0.00B
_Chan_NumWriters3_InputSize60000-8         95.0B ±65%
_ZenQ_NumWriters3_InputSize60000-8        34.8B ±127%
_Chan_NumWriters8_InputSize6000000-8       878B ±272%
_ZenQ_NumWriters8_InputSize6000000-8       671B ±222%
_Chan_NumWriters100_InputSize6000000-8    46.0kB ±31%
_ZenQ_NumWriters100_InputSize6000000-8   13.3kB ±100%
_Chan_NumWriters1000_InputSize7000000-8    488kB ± 8%
_ZenQ_NumWriters1000_InputSize7000000-8  2.37kB ±210%
_Chan_Million_Blocking_Writers-8           553MB ± 0%
_ZenQ_Million_Blocking_Writers-8          95.5MB ± 0%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8            0.00
_ZenQ_NumWriters1_InputSize600-8            0.00
_Chan_NumWriters3_InputSize60000-8          0.00
_ZenQ_NumWriters3_InputSize60000-8          0.00
_Chan_NumWriters8_InputSize6000000-8       2.77 ±225%
_ZenQ_NumWriters8_InputSize6000000-8       1.40 ±186%
_Chan_NumWriters100_InputSize6000000-8       164 ±20%
_ZenQ_NumWriters100_InputSize6000000-8     31.8 ±100%
_Chan_NumWriters1000_InputSize7000000-8    1.79k ± 3%
_ZenQ_NumWriters1000_InputSize7000000-8    5.50 ±227%
_Chan_Million_Blocking_Writers-8           2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8            995k ± 0%

Also note that for queue_size = 4096, current ZenQ performs slower than channels in the case of million blocking writers but increasing the queue_size to 2^14 for both, ZenQ now also performs better than channels in the million goroutines case.

queue_size = 2^14

Benchmark_Chan_Million_Blocking_Writers-8          	       1	10982172959 ns/op	551333344 B/op	 1991416 allocs/op
Benchmark_ZenQ_Million_Blocking_Writers-8          	       1	9734108333 ns/op	92514952 B/op	  963291 allocs/op
PASS
ok  	command-line-arguments	41.376s

@alphadose
Copy link
Author

@alphadose alphadose commented May 10, 2022

@egonelbre

Update:-
WIth a few tweaks, now ZenQ performs faster than channels even in the case of million goroutines for all queue_sizes

Here are the latest benchmarks for queue_size 4096

name                                     time/op
_Chan_NumWriters1_InputSize600-8          23.2µs ± 1%
_ZenQ_NumWriters1_InputSize600-8          18.1µs ± 1%
_Chan_NumWriters3_InputSize60000-8        5.52ms ± 3%
_ZenQ_NumWriters3_InputSize60000-8        2.67ms ± 6%
_Chan_NumWriters8_InputSize6000000-8       680ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8       308ms ± 4%
_Chan_NumWriters100_InputSize6000000-8     1.56s ± 6%
_ZenQ_NumWriters100_InputSize6000000-8     519ms ± 2%
_Chan_NumWriters1000_InputSize7000000-8    1.98s ± 1%
_ZenQ_NumWriters1000_InputSize7000000-8    441ms ±11%
_Chan_Million_Blocking_Writers-8           10.4s ± 3%
_ZenQ_Million_Blocking_Writers-8           8.56s ±24%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8           0.00B
_ZenQ_NumWriters1_InputSize600-8           0.00B
_Chan_NumWriters3_InputSize60000-8          110B ±68%
_ZenQ_NumWriters3_InputSize60000-8        23.6B ±107%
_Chan_NumWriters8_InputSize6000000-8       585B ±234%
_ZenQ_NumWriters8_InputSize6000000-8       411B ±299%
_Chan_NumWriters100_InputSize6000000-8    44.7kB ±35%
_ZenQ_NumWriters100_InputSize6000000-8    19.7kB ±78%
_Chan_NumWriters1000_InputSize7000000-8    483kB ±10%
_ZenQ_NumWriters1000_InputSize7000000-8  1.13kB ±602%
_Chan_Million_Blocking_Writers-8           553MB ± 0%
_ZenQ_Million_Blocking_Writers-8          95.5MB ± 0%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8            0.00
_ZenQ_NumWriters1_InputSize600-8            0.00
_Chan_NumWriters3_InputSize60000-8          0.00
_ZenQ_NumWriters3_InputSize60000-8          0.00
_Chan_NumWriters8_InputSize6000000-8       2.20 ±218%
_ZenQ_NumWriters8_InputSize6000000-8       0.90 ±344%
_Chan_NumWriters100_InputSize6000000-8       163 ±18%
_ZenQ_NumWriters100_InputSize6000000-8      47.0 ±79%
_Chan_NumWriters1000_InputSize7000000-8    1.79k ± 6%
_ZenQ_NumWriters1000_InputSize7000000-8    2.00 ±550%
_Chan_Million_Blocking_Writers-8           2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8            995k ± 0%

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented May 10, 2022

It does not make too much sense to talk about performance until there's feature parity. As I mentioned, it's easy to make a faster queue, if you don't need all the features. Similarly, there are still bugs in the Select, which make the benchmark results not useful.

This is the bug I mentioned in the previous comment (this program fails to complete sometimes):

package main

import (
	"sync"

	"github.com/alphadose/zenq"
)

func main() {
	var wg sync.WaitGroup
	defer wg.Wait()

	zq1 := zenq.New[int32]()
	zq2 := zenq.New[uint32]()

	wg.Add(2)
	go func() {
		defer wg.Done()
		for {
			v := zenq.Select(zq1, zq2)
			if v == int32(0) {
				break
			}
		}
	}()
	go func() {
		defer wg.Done()
		for {
			v := zenq.Select(zq1, zq2)
			if v == int32(0) {
				break
			}
		}
	}()

	for i := 0; i < 1e4; i++ {
		zq1.Write(1)
	}
	println("shutting down")
	zq1.Write(0) // using 0 because closing has not been implemented
	zq1.Write(0)
}

Also there's a mutex per element, which causes it to behave like a fifo. This is also introduces a significant memory overhead to things like make(chan struct{}, 1e9) and make(chan byte, 1e9).

EDIT: Removed note about fairness of two concurrent receives, as I'm not sure whether it's required.

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented May 10, 2022

Also, your implementation has an hardcoded size + requirement of power of 2 elements.

@alphadose
Copy link
Author

@alphadose alphadose commented May 10, 2022

Will work on the fairness of select by going through the channel tests

Also regarding the power of 2 elements, this is just a POC, a const is required for declaring arrays instead of slices
if this gets merged into the golang core, users can provide any size for the channel and the underlying implementation can be fastlog2() to get the nearest power of 2 and build the channel on top of that
for ex:- make(chan int, 100), the underlying size will be 1 << fastlog2(100) = ~128

The tradeoff for not using powers of 2 is that you have to use modulo instead of index masking to get read/write pointers which will cost a few nanoseconds of operation time

@alphadose
Copy link
Author

@alphadose alphadose commented May 10, 2022

Also there isn't 1 mutex per element, its just num(mutexes) = queue_size

there is only 1 mutex per slot, hence there isn't much memory overhead, this is evident from the million goroutines case, zenq took 400 MB lesser than channels

@alphadose
Copy link
Author

@alphadose alphadose commented May 10, 2022

@egonelbre just benchmarked now with type Payload byte , there isn't much memory overhead, same as before

goos: darwin
goarch: arm64
Benchmark_Chan_NumWriters1_InputSize600-8          	   52896	     19625 ns/op	       0 B/op	       0 allocs/op
Benchmark_ZenQ_NumWriters1_InputSize600-8          	   93112	     12912 ns/op	       0 B/op	       0 allocs/op
Benchmark_Chan_NumWriters3_InputSize60000-8        	     262	   4607065 ns/op	      74 B/op	       0 allocs/op
Benchmark_ZenQ_NumWriters3_InputSize60000-8        	     510	   2291942 ns/op	      35 B/op	       0 allocs/op
Benchmark_Chan_NumWriters8_InputSize6000000-8      	       2	 511401104 ns/op	     512 B/op	       2 allocs/op
Benchmark_ZenQ_NumWriters8_InputSize6000000-8      	       3	 355385736 ns/op	     970 B/op	       2 allocs/op
Benchmark_Chan_NumWriters100_InputSize6000000-8    	       1	1266029250 ns/op	   31104 B/op	     134 allocs/op
Benchmark_ZenQ_NumWriters100_InputSize6000000-8    	       2	 551560271 ns/op	   28912 B/op	      69 allocs/op
Benchmark_Chan_NumWriters1000_InputSize7000000-8   	       1	1860939083 ns/op	  483072 B/op	    1775 allocs/op
Benchmark_ZenQ_NumWriters1000_InputSize7000000-8   	       3	 507001500 ns/op	       0 B/op	       0 allocs/op
Benchmark_Chan_Million_Blocking_Writers-8          	       1	10854043417 ns/op	552819712 B/op	 1997209 allocs/op
Benchmark_ZenQ_Million_Blocking_Writers-8          	       1	10369999125 ns/op	95540840 B/op	  994865 allocs/op
PASS
ok  	command-line-arguments	39.695s

Same with type Payload struct{}

Putting benchmarks aside, anything else from selection fairness and closing channels to be implemented to achieve feature parity ?

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented May 10, 2022

Also regarding the power of 2 elements...

I don't think restricting to pow2 is an acceptable tradeoff.

const is required for declaring arrays instead..

The slice should be implementable similarly how it's implemented currently. I suspect it still would need to be a slice in the runtime, otherwise it would increase binary bloat.

there is only 1 mutex per slot, hence there isn't much memory overhead, this is evident from the million goroutines case, zenq took 400 MB lesser than channels

Size benchmark:

func BenchmarkChan_New(b *testing.B) {
	b.Run("struct{}", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			ch := make(chan struct{}, zenq.QueueSize)
			runtime.KeepAlive(ch)
		}
	})
	b.Run("byte", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			ch := make(chan byte, zenq.QueueSize)
			runtime.KeepAlive(ch)
		}
	})
	b.Run("int64", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			ch := make(chan int64, zenq.QueueSize)
			runtime.KeepAlive(ch)
		}
	})
}

func BenchmarkZenQ_New(b *testing.B) {
	b.Run("struct{}", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			zq := zenq.New[struct{}]()
			runtime.KeepAlive(zq)
		}
	})
	b.Run("byte", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			zq := zenq.New[byte]()
			runtime.KeepAlive(zq)
		}
	})
	b.Run("int64", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			zq := zenq.New[int64]()
			runtime.KeepAlive(zq)
		}
	})
}
BenchmarkChan_New/struct{}-32           29258943                42.65 ns/op           96 B/op          1 allocs/op
BenchmarkChan_New/byte-32                1000000              1110 ns/op            4864 B/op          1 allocs/op
BenchmarkChan_New/int64-32                144070              8599 ns/op           40960 B/op          1 allocs/op
BenchmarkZenQ_New/struct{}-32              48247             27403 ns/op          139264 B/op          1 allocs/op
BenchmarkZenQ_New/byte-32                  46503             26348 ns/op          139264 B/op          1 allocs/op
BenchmarkZenQ_New/int64-32                 46122             26360 ns/op          139264 B/op          1 allocs/op

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented May 10, 2022

Putting benchmarks aside, anything else from selection fairness and closing channels to be implemented to achieve feature parity?

I don't think this is the full list and I'm not the person who can approve the change, but from the top of my head:

  1. Ensure it can pass all the tests in https://github.com/golang/go/blob/master/src/runtime/chan_test.go.
  2. Run all the benchmarks in https://github.com/golang/go/blob/master/src/runtime/chan_test.go.
  3. No polling (including select)
  4. SPIN model might be necessary, if you decide for a complete rewrite (there mentions of this in the other issue)
  5. Benchmark across multiple machines and many cores (e.g. 1,2,4,8,16,32,48)
  6. noticed that https://go-review.googlesource.com/c/go/+/16740/ indicates some fairness requirements -- i.e. requirement of FIFO (for non-select)
  7. Run benchmarks across the entire go repository and the external testsuite, demonstrating that common cases don't suffer. (e.g. ctx.Done(), timers and friends)
  8. Integrate it into runtime/chan.go -- there have been faster implementations before, so it's known the perf of the current implementation can be improved...
  9. plays well with the race detector

Or in other words, the problem isn't coming up with a slightly better queue, but integrating it into Go codebase and ensuring it doesn't break anything and preserves the necessary properties.

@alphadose
Copy link
Author

@alphadose alphadose commented May 10, 2022

Regarding the race detector, I cannot currently use race.Acquire since this is not runtime internal package
this is only possible from within the go codebase

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented May 10, 2022

Ah, also select needs to support both send and recv. Similarly the case of recv/send from nil channels.

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented May 11, 2022

Apparently I made a critical error in one of my last comments:

I'm the person who can approve the change....

It should have said "I'm not the person who can approve..."... I now need to find the largest facepalm emote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation
Projects
None yet
Development

No branches or pull requests

4 participants