Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) #52652

Open
alphadose opened this issue May 2, 2022 · 44 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@alphadose
Copy link

alphadose commented May 2, 2022

Current implementation of channels could use some improvements in terms of ns/op, B/op and allocs/op

I have made a POC thread-safe queue which has better performance in all the above 3 metrics https://github.com/alphadose/ZenQ

Here are the benchmarks vs native channels https://github.com/alphadose/ZenQ#benchmarks for a wide range of use cases

My proposal is to make the underlying implementation of native channels more efficient by changing the algorithm and data structure to the lock-free one used in ZenQ or some other suitable implementation (debatable)

@alphadose alphadose changed the title Make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) proposal: make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) May 2, 2022
@gopherbot gopherbot added this to the Proposal milestone May 2, 2022
@ianlancetaylor

This comment was marked as resolved.

@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Proposal labels May 2, 2022
@ianlancetaylor ianlancetaylor modified the milestones: Proposal, Backlog May 2, 2022
@ianlancetaylor ianlancetaylor added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label May 2, 2022
@alphadose

This comment was marked as resolved.

@ianlancetaylor

This comment was marked as resolved.

@ianlancetaylor ianlancetaylor removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label May 2, 2022
@ianlancetaylor
Copy link
Member

I took a very quick look. Channels aren't just thread-safe queues; they also have to handle the select statement and the close function. Can we retain the speed improvements while still implementing all required features?

CC @golang/runtime

@ianlancetaylor ianlancetaylor changed the title proposal: make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) runtime: make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) May 2, 2022
@alphadose
Copy link
Author

Yep, I wanted that only

Syntactic Sugar of channels + awesome perf

Since channels are used a lot, this might help people out

@alphadose
Copy link
Author

PS:- this is just a POC, as you said it needs to be fully compatible with the linked methods in https://go.dev/src/runtime/chan.go

@egonelbre
Copy link
Contributor

egonelbre commented May 3, 2022

Similarly, the implementation busy-spins when it waits, which is not suitable for the default implementation.

Also, there's a separate issue for having lock-free channels #8899

@alphadose
Copy link
Author

alphadose commented May 3, 2022

The spinning is applied on a slot in the queue whose size can range from 2 to 2^64 - 1

Due to the lack of a global lock, the spinning is used for contesting slots with slot specific states due to which there are low number of collisions among reader and writer indexes due to the large number of slots and also a lot less time is spent in the Compare and Swap for loops because each lock/state has its own individual memory address. But you are right, in case there are a lot of concurrent writer goroutines (1 million or more) its better to use exponential backoff or just make the goroutine sleep/park to reduce slot contention and optimize even more

As for #8899, my implementation has lower allocs/op in comparison to that lock-free buffer as well as the native channel and is even a lot faster

@egonelbre
Copy link
Contributor

egonelbre commented May 3, 2022

You need to implement select, select fairness, parking and closing etc. before you can draw any conclusions on the performance. Of course, you can make a faster implementation when you discard some of the requirements.

Similarly, also run the benchmarks that are in https://github.com/golang/go/blob/master/src/runtime/chan_test.go. Also you need to show benchmarks such as 1M blocked writers. If you are not demonstrating where your code is worse, then it's not a fair evaluation.

@alphadose
Copy link
Author

Got it, I will draw another benchmark to replicate the 1M blocked writers scenario for both channels and zenQ and note down the resource utilization

@alphadose
Copy link
Author

After that, we can have a look at how to implement select {} like mechanism between multiple zenQs

@egonelbre
Copy link
Contributor

Note for benchmarking please see https://pkg.go.dev/golang.org/x/tools/cmd/benchcmp and http://www.inanzzz.com/index.php/post/yz8n/using-golang-bench-benchstat-and-benchcmp-to-measure-performance. You need statistical significance when measuring.

I also recommend reading through the lock-free channel issue fully, the design documents and previous pull-requests, before continuing. That way it'll avoid some topics that have come up before.

@alphadose
Copy link
Author

alphadose commented May 4, 2022

@egonelbre I was able to implement parking thereby saving a ton of goroutines, and now ZenQ is more efficient than channels even in the case of 1 million blocking goroutines

Here are the latest benchmarks run over 30 cases

name                                     time/op
_Chan_NumWriters1_InputSize600-8         24.6µs ± 1%
_ZenQ_NumWriters1_InputSize600-8         16.5µs ± 1%
_Chan_NumWriters3_InputSize60000-8       6.21ms ± 2%
_ZenQ_NumWriters3_InputSize60000-8       2.85ms ± 0%
_Chan_NumWriters8_InputSize6000000-8      735ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8      417ms ± 0%
_Chan_NumWriters100_InputSize6000000-8    1.61s ± 1%
_ZenQ_NumWriters100_InputSize6000000-8    741ms ± 3%
_Chan_NumWriters1000_InputSize7000000-8   1.98s ± 0%
_ZenQ_NumWriters1000_InputSize7000000-8   1.05s ± 1%
_Chan_Million_Blocking_Writers-8          10.0s ±13%
_ZenQ_Million_Blocking_Writers-8          7.01s ±44%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8          0.00B
_ZenQ_NumWriters1_InputSize600-8          0.00B
_Chan_NumWriters3_InputSize60000-8         106B ±88%
_ZenQ_NumWriters3_InputSize60000-8       28.9B ±111%
_Chan_NumWriters8_InputSize6000000-8      946B ±267%
_ZenQ_NumWriters8_InputSize6000000-8      885B ±163%
_Chan_NumWriters100_InputSize6000000-8   46.7kB ±25%
_ZenQ_NumWriters100_InputSize6000000-8   16.2kB ±66%
_Chan_NumWriters1000_InputSize7000000-8   484kB ±10%
_ZenQ_NumWriters1000_InputSize7000000-8  62.4kB ±82%
_Chan_Million_Blocking_Writers-8          553MB ± 0%
_ZenQ_Million_Blocking_Writers-8         95.9MB ± 0%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8           0.00
_ZenQ_NumWriters1_InputSize600-8           0.00
_Chan_NumWriters3_InputSize60000-8         0.00
_ZenQ_NumWriters3_InputSize60000-8         0.00
_Chan_NumWriters8_InputSize6000000-8      3.07 ±193%
_ZenQ_NumWriters8_InputSize6000000-8      2.07 ±142%
_Chan_NumWriters100_InputSize6000000-8      166 ±15%
_ZenQ_NumWriters100_InputSize6000000-8     53.5 ±50%
_Chan_NumWriters1000_InputSize7000000-8   1.74k ± 7%
_ZenQ_NumWriters1000_InputSize7000000-8     525 ±39%
_Chan_Million_Blocking_Writers-8          2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8          1.00M ± 0%

Further improvements:-

More improvements can be done by using runtime internal calls like goparkunlock() , getg() and goready() to schedule goroutines more effectively

But this requires assembly stubs to load the *g pointer, here is my current implementation please have a look at it https://github.com/alphadose/ZenQ/blob/main/lib_runtime_linkage.go and suggest if there are any better methods

@alphadose
Copy link
Author

@egonelbre I was able to implement select{}, which does fair selection based on the least number of reads among various queues, implementation -> here

Here is an example code demonstrating its usage

package main

import (
	"fmt"

	"github.com/alphadose/zenq"
)

type custom1 struct {
	alpha int
	beta  string
}

type custom2 struct {
	gamma int
}

var (
	zq1 = zenq.New[int]()
	zq2 = zenq.New[string]()
	zq3 = zenq.New[custom1]()
	zq4 = zenq.New[*custom2]()
)

func main() {
	go looper(intProducer)
	go looper(stringProducer)
	go looper(custom1Producer)
	go looper(custom2Producer)

	for i := 0; i < 40; i++ {

		// Selection occurs here
		switch data := zenq.Select(zq1, zq2, zq3, zq4).(type) {
		case int:
			fmt.Printf("Received int %d\n", data)
		case string:
			fmt.Printf("Received string %s\n", data)
		case custom1:
			fmt.Printf("Received custom data type number 1 %#v\n", data)
		case *custom2:
			fmt.Printf("Received pointer %#v\n", data)
		}
	}
}

func intProducer(ctr int) { zq1.Write(ctr) }

func stringProducer(ctr int) { zq2.Write(fmt.Sprint(ctr * 10)) }

func custom1Producer(ctr int) { zq3.Write(custom1{alpha: ctr, beta: fmt.Sprint(ctr)}) }

func custom2Producer(ctr int) { zq4.Write(&custom2{gamma: 1 << ctr}) }

func looper(producer func(ctr int)) {
	for i := 0; i < 10; i++ {
		producer(i)
	}
}

@egonelbre
Copy link
Contributor

@alphadose it's a polling implementation, which it should not be. The select contains a lockup with regards multiple selects on the same channel, even when there is a possibility for usual select to proceed.

Similarly, the fairness guarantee doesn't seem to hold:

func main() {
	zq1 := zenq.New[int32]()
	zq2 := zenq.New[uint32]()

	go func() { for { zq1.Write(1) } }()
	go func() { for { zq2.Write(1) } }()

	for i := 0; i < 1e3; i++ { zq1.Read() }

	for i := 0; i < 100; i++ {
		data := zenq.Select(zq1, zq2)
		fmt.Println(data) // this only currently prints 2; however the select should be fair between zq1 and zq2
	}
}

I'm all for implementing custom primitives for niche cases, however, I would recommend trying to improve the performance of the current implementation, rather than implementing all the minutia from scratch.

@alphadose
Copy link
Author

alphadose commented May 6, 2022

Agreed regarding the fairness

Also for the performance, I am currently testing by writing assembly to load the *g pointer from thread local storage and then try benchmarking with parking/unparking implementation, hopefully we can see some good improvement

@alphadose
Copy link
Author

@egonelbre I was able to get a good improvement in performance by using gopark() and goready(). Here is the comparison benchmark with the old algorithm run over 30 cases and queue_size = 4096

name                                     old time/op    new time/op    delta
_ZenQ_NumWriters1_InputSize600-8           16.5µs ± 1%    17.9µs ± 1%   +8.65%  (p=0.000 n=28+29)
_ZenQ_NumWriters3_InputSize60000-8         2.85ms ± 0%    2.67ms ± 6%   -6.11%  (p=0.000 n=23+30)
_ZenQ_NumWriters8_InputSize6000000-8        417ms ± 0%     313ms ± 5%  -24.83%  (p=0.000 n=23+29)
_ZenQ_NumWriters100_InputSize6000000-8      741ms ± 3%     516ms ± 2%  -30.40%  (p=0.000 n=29+30)
_ZenQ_NumWriters1000_InputSize7000000-8     1.05s ± 1%     0.45s ± 9%  -57.58%  (p=0.000 n=28+30)
_ZenQ_Million_Blocking_Writers-8            7.01s ±44%    10.98s ± 4%  +56.54%  (p=0.000 n=30+28)

name                                     old alloc/op   new alloc/op   delta
_ZenQ_NumWriters1_InputSize600-8            0.00B          0.00B          ~     (all equal)
_ZenQ_NumWriters3_InputSize60000-8         28.9B ±111%    34.8B ±127%     ~     (p=0.268 n=30+29)
_ZenQ_NumWriters8_InputSize6000000-8        885B ±163%     671B ±222%     ~     (p=0.208 n=30+30)
_ZenQ_NumWriters100_InputSize6000000-8     16.2kB ±66%   13.3kB ±100%     ~     (p=0.072 n=30+30)
_ZenQ_NumWriters1000_InputSize7000000-8    62.4kB ±82%    2.4kB ±210%  -96.20%  (p=0.000 n=30+30)
_ZenQ_Million_Blocking_Writers-8           95.9MB ± 0%    95.5MB ± 0%   -0.41%  (p=0.000 n=28+30)

name                                     old allocs/op  new allocs/op  delta
_ZenQ_NumWriters1_InputSize600-8             0.00           0.00          ~     (all equal)
_ZenQ_NumWriters3_InputSize60000-8           0.00           0.00          ~     (all equal)
_ZenQ_NumWriters8_InputSize6000000-8        2.07 ±142%     1.40 ±186%     ~     (p=0.081 n=30+30)
_ZenQ_NumWriters100_InputSize6000000-8       53.5 ±50%     31.8 ±100%  -40.60%  (p=0.000 n=30+30)
_ZenQ_NumWriters1000_InputSize7000000-8       525 ±39%        6 ±227%  -98.95%  (p=0.000 n=30+30)
_ZenQ_Million_Blocking_Writers-8            1.00M ± 0%     0.99M ± 0%   -0.41%  (p=0.000 n=28+29)

Here is the current comparison vs channels also run over 30 cases and queue_size = 4096

name                                     time/op
_Chan_NumWriters1_InputSize600-8          23.3µs ± 1%
_ZenQ_NumWriters1_InputSize600-8          17.9µs ± 1%
_Chan_NumWriters3_InputSize60000-8        5.48ms ± 3%
_ZenQ_NumWriters3_InputSize60000-8        2.67ms ± 6%
_Chan_NumWriters8_InputSize6000000-8       679ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8       313ms ± 5%
_Chan_NumWriters100_InputSize6000000-8     1.58s ± 1%
_ZenQ_NumWriters100_InputSize6000000-8     516ms ± 2%
_Chan_NumWriters1000_InputSize7000000-8    1.97s ± 1%
_ZenQ_NumWriters1000_InputSize7000000-8    445ms ± 9%
_Chan_Million_Blocking_Writers-8           10.8s ± 1%
_ZenQ_Million_Blocking_Writers-8           11.0s ± 4%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8           0.00B
_ZenQ_NumWriters1_InputSize600-8           0.00B
_Chan_NumWriters3_InputSize60000-8         95.0B ±65%
_ZenQ_NumWriters3_InputSize60000-8        34.8B ±127%
_Chan_NumWriters8_InputSize6000000-8       878B ±272%
_ZenQ_NumWriters8_InputSize6000000-8       671B ±222%
_Chan_NumWriters100_InputSize6000000-8    46.0kB ±31%
_ZenQ_NumWriters100_InputSize6000000-8   13.3kB ±100%
_Chan_NumWriters1000_InputSize7000000-8    488kB ± 8%
_ZenQ_NumWriters1000_InputSize7000000-8  2.37kB ±210%
_Chan_Million_Blocking_Writers-8           553MB ± 0%
_ZenQ_Million_Blocking_Writers-8          95.5MB ± 0%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8            0.00
_ZenQ_NumWriters1_InputSize600-8            0.00
_Chan_NumWriters3_InputSize60000-8          0.00
_ZenQ_NumWriters3_InputSize60000-8          0.00
_Chan_NumWriters8_InputSize6000000-8       2.77 ±225%
_ZenQ_NumWriters8_InputSize6000000-8       1.40 ±186%
_Chan_NumWriters100_InputSize6000000-8       164 ±20%
_ZenQ_NumWriters100_InputSize6000000-8     31.8 ±100%
_Chan_NumWriters1000_InputSize7000000-8    1.79k ± 3%
_ZenQ_NumWriters1000_InputSize7000000-8    5.50 ±227%
_Chan_Million_Blocking_Writers-8           2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8            995k ± 0%

Also note that for queue_size = 4096, current ZenQ performs slower than channels in the case of million blocking writers but increasing the queue_size to 2^14 for both, ZenQ now also performs better than channels in the million goroutines case.

queue_size = 2^14

Benchmark_Chan_Million_Blocking_Writers-8          	       1	10982172959 ns/op	551333344 B/op	 1991416 allocs/op
Benchmark_ZenQ_Million_Blocking_Writers-8          	       1	9734108333 ns/op	92514952 B/op	  963291 allocs/op
PASS
ok  	command-line-arguments	41.376s

@alphadose
Copy link
Author

@egonelbre

Update:-
WIth a few tweaks, now ZenQ performs faster than channels even in the case of million goroutines for all queue_sizes

Here are the latest benchmarks for queue_size 4096

name                                     time/op
_Chan_NumWriters1_InputSize600-8          23.2µs ± 1%
_ZenQ_NumWriters1_InputSize600-8          18.1µs ± 1%
_Chan_NumWriters3_InputSize60000-8        5.52ms ± 3%
_ZenQ_NumWriters3_InputSize60000-8        2.67ms ± 6%
_Chan_NumWriters8_InputSize6000000-8       680ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8       308ms ± 4%
_Chan_NumWriters100_InputSize6000000-8     1.56s ± 6%
_ZenQ_NumWriters100_InputSize6000000-8     519ms ± 2%
_Chan_NumWriters1000_InputSize7000000-8    1.98s ± 1%
_ZenQ_NumWriters1000_InputSize7000000-8    441ms ±11%
_Chan_Million_Blocking_Writers-8           10.4s ± 3%
_ZenQ_Million_Blocking_Writers-8           8.56s ±24%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8           0.00B
_ZenQ_NumWriters1_InputSize600-8           0.00B
_Chan_NumWriters3_InputSize60000-8          110B ±68%
_ZenQ_NumWriters3_InputSize60000-8        23.6B ±107%
_Chan_NumWriters8_InputSize6000000-8       585B ±234%
_ZenQ_NumWriters8_InputSize6000000-8       411B ±299%
_Chan_NumWriters100_InputSize6000000-8    44.7kB ±35%
_ZenQ_NumWriters100_InputSize6000000-8    19.7kB ±78%
_Chan_NumWriters1000_InputSize7000000-8    483kB ±10%
_ZenQ_NumWriters1000_InputSize7000000-8  1.13kB ±602%
_Chan_Million_Blocking_Writers-8           553MB ± 0%
_ZenQ_Million_Blocking_Writers-8          95.5MB ± 0%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8            0.00
_ZenQ_NumWriters1_InputSize600-8            0.00
_Chan_NumWriters3_InputSize60000-8          0.00
_ZenQ_NumWriters3_InputSize60000-8          0.00
_Chan_NumWriters8_InputSize6000000-8       2.20 ±218%
_ZenQ_NumWriters8_InputSize6000000-8       0.90 ±344%
_Chan_NumWriters100_InputSize6000000-8       163 ±18%
_ZenQ_NumWriters100_InputSize6000000-8      47.0 ±79%
_Chan_NumWriters1000_InputSize7000000-8    1.79k ± 6%
_ZenQ_NumWriters1000_InputSize7000000-8    2.00 ±550%
_Chan_Million_Blocking_Writers-8           2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8            995k ± 0%

@egonelbre
Copy link
Contributor

egonelbre commented May 10, 2022

It does not make too much sense to talk about performance until there's feature parity. As I mentioned, it's easy to make a faster queue, if you don't need all the features. Similarly, there are still bugs in the Select, which make the benchmark results not useful.

This is the bug I mentioned in the previous comment (this program fails to complete sometimes):

package main

import (
	"sync"

	"github.com/alphadose/zenq"
)

func main() {
	var wg sync.WaitGroup
	defer wg.Wait()

	zq1 := zenq.New[int32]()
	zq2 := zenq.New[uint32]()

	wg.Add(2)
	go func() {
		defer wg.Done()
		for {
			v := zenq.Select(zq1, zq2)
			if v == int32(0) {
				break
			}
		}
	}()
	go func() {
		defer wg.Done()
		for {
			v := zenq.Select(zq1, zq2)
			if v == int32(0) {
				break
			}
		}
	}()

	for i := 0; i < 1e4; i++ {
		zq1.Write(1)
	}
	println("shutting down")
	zq1.Write(0) // using 0 because closing has not been implemented
	zq1.Write(0)
}

Also there's a mutex per element, which causes it to behave like a fifo. This is also introduces a significant memory overhead to things like make(chan struct{}, 1e9) and make(chan byte, 1e9).

EDIT: Removed note about fairness of two concurrent receives, as I'm not sure whether it's required.

@egonelbre
Copy link
Contributor

Also, your implementation has an hardcoded size + requirement of power of 2 elements.

@alphadose
Copy link
Author

Will work on the fairness of select by going through the channel tests

Also regarding the power of 2 elements, this is just a POC, a const is required for declaring arrays instead of slices
if this gets merged into the golang core, users can provide any size for the channel and the underlying implementation can be fastlog2() to get the nearest power of 2 and build the channel on top of that
for ex:- make(chan int, 100), the underlying size will be 1 << fastlog2(100) = ~128

The tradeoff for not using powers of 2 is that you have to use modulo instead of index masking to get read/write pointers which will cost a few nanoseconds of operation time

@alphadose
Copy link
Author

Also there isn't 1 mutex per element, its just num(mutexes) = queue_size

there is only 1 mutex per slot, hence there isn't much memory overhead, this is evident from the million goroutines case, zenq took 400 MB lesser than channels

@alphadose
Copy link
Author

@egonelbre just benchmarked now with type Payload byte , there isn't much memory overhead, same as before

goos: darwin
goarch: arm64
Benchmark_Chan_NumWriters1_InputSize600-8          	   52896	     19625 ns/op	       0 B/op	       0 allocs/op
Benchmark_ZenQ_NumWriters1_InputSize600-8          	   93112	     12912 ns/op	       0 B/op	       0 allocs/op
Benchmark_Chan_NumWriters3_InputSize60000-8        	     262	   4607065 ns/op	      74 B/op	       0 allocs/op
Benchmark_ZenQ_NumWriters3_InputSize60000-8        	     510	   2291942 ns/op	      35 B/op	       0 allocs/op
Benchmark_Chan_NumWriters8_InputSize6000000-8      	       2	 511401104 ns/op	     512 B/op	       2 allocs/op
Benchmark_ZenQ_NumWriters8_InputSize6000000-8      	       3	 355385736 ns/op	     970 B/op	       2 allocs/op
Benchmark_Chan_NumWriters100_InputSize6000000-8    	       1	1266029250 ns/op	   31104 B/op	     134 allocs/op
Benchmark_ZenQ_NumWriters100_InputSize6000000-8    	       2	 551560271 ns/op	   28912 B/op	      69 allocs/op
Benchmark_Chan_NumWriters1000_InputSize7000000-8   	       1	1860939083 ns/op	  483072 B/op	    1775 allocs/op
Benchmark_ZenQ_NumWriters1000_InputSize7000000-8   	       3	 507001500 ns/op	       0 B/op	       0 allocs/op
Benchmark_Chan_Million_Blocking_Writers-8          	       1	10854043417 ns/op	552819712 B/op	 1997209 allocs/op
Benchmark_ZenQ_Million_Blocking_Writers-8          	       1	10369999125 ns/op	95540840 B/op	  994865 allocs/op
PASS
ok  	command-line-arguments	39.695s

Same with type Payload struct{}

Putting benchmarks aside, anything else from selection fairness and closing channels to be implemented to achieve feature parity ?

@egonelbre
Copy link
Contributor

Also regarding the power of 2 elements...

I don't think restricting to pow2 is an acceptable tradeoff.

const is required for declaring arrays instead..

The slice should be implementable similarly how it's implemented currently. I suspect it still would need to be a slice in the runtime, otherwise it would increase binary bloat.

there is only 1 mutex per slot, hence there isn't much memory overhead, this is evident from the million goroutines case, zenq took 400 MB lesser than channels

Size benchmark:

func BenchmarkChan_New(b *testing.B) {
	b.Run("struct{}", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			ch := make(chan struct{}, zenq.QueueSize)
			runtime.KeepAlive(ch)
		}
	})
	b.Run("byte", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			ch := make(chan byte, zenq.QueueSize)
			runtime.KeepAlive(ch)
		}
	})
	b.Run("int64", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			ch := make(chan int64, zenq.QueueSize)
			runtime.KeepAlive(ch)
		}
	})
}

func BenchmarkZenQ_New(b *testing.B) {
	b.Run("struct{}", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			zq := zenq.New[struct{}]()
			runtime.KeepAlive(zq)
		}
	})
	b.Run("byte", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			zq := zenq.New[byte]()
			runtime.KeepAlive(zq)
		}
	})
	b.Run("int64", func(b *testing.B) {
		b.ReportAllocs()
		for i := 0; i < b.N; i++ {
			zq := zenq.New[int64]()
			runtime.KeepAlive(zq)
		}
	})
}
BenchmarkChan_New/struct{}-32           29258943                42.65 ns/op           96 B/op          1 allocs/op
BenchmarkChan_New/byte-32                1000000              1110 ns/op            4864 B/op          1 allocs/op
BenchmarkChan_New/int64-32                144070              8599 ns/op           40960 B/op          1 allocs/op
BenchmarkZenQ_New/struct{}-32              48247             27403 ns/op          139264 B/op          1 allocs/op
BenchmarkZenQ_New/byte-32                  46503             26348 ns/op          139264 B/op          1 allocs/op
BenchmarkZenQ_New/int64-32                 46122             26360 ns/op          139264 B/op          1 allocs/op

@egonelbre
Copy link
Contributor

egonelbre commented May 10, 2022

Putting benchmarks aside, anything else from selection fairness and closing channels to be implemented to achieve feature parity?

I don't think this is the full list and I'm not the person who can approve the change, but from the top of my head:

  1. Ensure it can pass all the tests in https://github.com/golang/go/blob/master/src/runtime/chan_test.go.
  2. Run all the benchmarks in https://github.com/golang/go/blob/master/src/runtime/chan_test.go.
  3. No polling (including select)
  4. SPIN model might be necessary, if you decide for a complete rewrite (there mentions of this in the other issue)
  5. Benchmark across multiple machines and many cores (e.g. 1,2,4,8,16,32,48)
  6. noticed that https://go-review.googlesource.com/c/go/+/16740/ indicates some fairness requirements -- i.e. requirement of FIFO (for non-select)
  7. Run benchmarks across the entire go repository and the external testsuite, demonstrating that common cases don't suffer. (e.g. ctx.Done(), timers and friends)
  8. Integrate it into runtime/chan.go -- there have been faster implementations before, so it's known the perf of the current implementation can be improved...
  9. plays well with the race detector

Or in other words, the problem isn't coming up with a slightly better queue, but integrating it into Go codebase and ensuring it doesn't break anything and preserves the necessary properties.

@alphadose
Copy link
Author

Regarding the race detector, I cannot currently use race.Acquire since this is not runtime internal package
this is only possible from within the go codebase

@egonelbre
Copy link
Contributor

Ah, also select needs to support both send and recv. Similarly the case of recv/send from nil channels.

@egonelbre
Copy link
Contributor

Apparently I made a critical error in one of my last comments:

I'm the person who can approve the change....

It should have said "I'm not the person who can approve..."... I now need to find the largest facepalm emote.

@alphadose
Copy link
Author

alphadose commented May 23, 2022

@egonelbre

Updates:-

  1. Closing implemented
  2. Selection implemented without polling (scope for further improvements present)
  3. Performance improved

Now I will run this implementation on the test suite you provided and try to complete all tests including modifications to the current implementation if required

@alphadose
Copy link
Author

alphadose commented May 24, 2022

@egonelbre Updates:-

By adding direct_send and an optimistic first pass read approach to the selector, I was able to improve its performance

Its now comparable to channels although still not as fast for large batch sizes
Here are some basic benchmarks for selection

With Input Batch Size: 60 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 61.375µs
ZenQ Select Runner completed transfer in: 73.416µs
====================================================================

With Input Batch Size: 600 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 109.041µs
ZenQ Select Runner completed transfer in: 399.125µs
====================================================================

With Input Batch Size: 6000 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 896.917µs
ZenQ Select Runner completed transfer in: 3.627416ms
====================================================================

With Input Batch Size: 600000 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 129.959167ms
ZenQ Select Runner completed transfer in: 300.164333ms
====================================================================

@egonelbre
Copy link
Contributor

@alphadose note, there's still a significant memory usage difference between zenq and channels. 5x for the usual case.

@alphadose
Copy link
Author

In terms of memory usage, it would be better to split it into 2 categories, initialization memory, and operational memory usage for fair comparison

In terms of initialization memory, channels are better and have a lower footprint

But in terms of operational memory usage ZenQ is better

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8           0.00B     
_ZenQ_NumWriters1_InputSize600-8           0.00B     
_Chan_NumWriters3_InputSize60000-8          104B ±54%
_ZenQ_NumWriters3_InputSize60000-8         22.2B ±91%
_Chan_NumWriters8_InputSize6000000-8       813B ±307%
_ZenQ_NumWriters8_InputSize6000000-8       690B ±115%
_Chan_NumWriters100_InputSize6000000-8    42.6kB ±36%
_ZenQ_NumWriters100_InputSize6000000-8   5.92kB ±118%
_Chan_NumWriters1000_InputSize7000000-8    475kB ±11%
_ZenQ_NumWriters1000_InputSize7000000-8   41.6kB ±34%
_Chan_Million_Blocking_Writers-8           553MB ± 0%
_ZenQ_Million_Blocking_Writers-8          47.4MB ± 9%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8            0.00     
_ZenQ_NumWriters1_InputSize600-8            0.00     
_Chan_NumWriters3_InputSize60000-8          0.00     
_ZenQ_NumWriters3_InputSize60000-8          0.00     
_Chan_NumWriters8_InputSize6000000-8       2.77 ±225%
_ZenQ_NumWriters8_InputSize6000000-8        2.63 ±52%
_Chan_NumWriters100_InputSize6000000-8       157 ±17%
_ZenQ_NumWriters100_InputSize6000000-8     14.3 ±116%
_Chan_NumWriters1000_InputSize7000000-8    1.77k ± 5%
_ZenQ_NumWriters1000_InputSize7000000-8     30.3 ±42%
_Chan_Million_Blocking_Writers-8           2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8           1.00M ± 0%

@alphadose
Copy link
Author

alphadose commented May 31, 2022

If you pay the initialization cost, you can gain a substantial reduction in operational memory usage

and the initialization cost is really cheap considering hardware nowadays

its the operational cost which needs to be reduced and optimised

@egonelbre
Copy link
Contributor

There are many ephemeral channels for context cancellations (and similar) that are impacted by the memory usage. Of course, it's difficult to understand the real-world impact without testing. Based on my experience, small channels are more common than large channels that require high-throughput.

@alphadose
Copy link
Author

true, if throughput is less then it is better to use channels to save resources

but there also have been cases where this was insufficient in which scenarios users implemented their own queues

imo golang should also provide a throughput optimised communication mechanism

@egonelbre
Copy link
Contributor

If 95% (just guessing) of the channel usage is non-throughput related then it needs to be carefully balanced with throughput. Similarly, the performance and memory usage of such use-cases is probably more important for the stdlib than high-workload and high-throughput.

Similarly, As you've demonstrated it's possible to have external library queues that trade-off memory for performance. So there's no pressing need to have it in the stdlib, however, I definitely agree it would be nice.

@alphadose
Copy link
Author

alphadose commented May 31, 2022

Actually, there is one more imp reason for making it a part of the std library

Currently, it is not possible to access the goroutine struct and other such things from external lib

In which case there are more optimisations that can be made making the current ZenQ even more efficient which wont be possible from external lib

Extracting the last amount of juice can only be done from golang internal lib and after this is done, no external lib will be able to match the performance of native (unless its a better algorithm)

@alphadose
Copy link
Author

alphadose commented Jun 9, 2022

@egonelbre performance improved a lot in the case of million goroutines

name                                     time/op
_Chan_NumWriters1_InputSize600-8         23.3µs ± 1%
_ZenQ_NumWriters1_InputSize600-8         38.7µs ± 1%
_Chan_NumWriters3_InputSize60000-8       5.48ms ± 1%
_ZenQ_NumWriters3_InputSize60000-8       2.63ms ± 1%
_Chan_NumWriters8_InputSize6000000-8      685ms ± 1%
_ZenQ_NumWriters8_InputSize6000000-8      254ms ± 3%
_Chan_NumWriters100_InputSize6000000-8    1.60s ± 1%
_ZenQ_NumWriters100_InputSize6000000-8    298ms ± 1%
_Chan_NumWriters1000_InputSize7000000-8   1.98s ± 1%
_ZenQ_NumWriters1000_InputSize7000000-8   409ms ± 1%
_Chan_Million_Blocking_Writers-8          10.5s ± 1%
_ZenQ_Million_Blocking_Writers-8          1.99s ±16%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8          0.00B
_ZenQ_NumWriters1_InputSize600-8          0.00B
_Chan_NumWriters3_InputSize60000-8       17.5B ±163%
_ZenQ_NumWriters3_InputSize60000-8       13.4B ±348%
_Chan_NumWriters8_InputSize6000000-8      123B ±148%
_ZenQ_NumWriters8_InputSize6000000-8       545B ±56%
_Chan_NumWriters100_InputSize6000000-8   36.1kB ±49%
_ZenQ_NumWriters100_InputSize6000000-8   9.32kB ±32%
_Chan_NumWriters1000_InputSize7000000-8   479kB ± 8%
_ZenQ_NumWriters1000_InputSize7000000-8  89.3kB ± 5%
_Chan_Million_Blocking_Writers-8          553MB ± 0%
_ZenQ_Million_Blocking_Writers-8          122MB ± 3%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8           0.00
_ZenQ_NumWriters1_InputSize600-8           0.00
_Chan_NumWriters3_InputSize60000-8         0.00
_ZenQ_NumWriters3_InputSize60000-8         0.00
_Chan_NumWriters8_InputSize6000000-8      1.10 ±173%
_ZenQ_NumWriters8_InputSize6000000-8       3.19 ±57%
_Chan_NumWriters100_InputSize6000000-8      140 ±32%
_ZenQ_NumWriters100_InputSize6000000-8     21.8 ±33%
_Chan_NumWriters1000_InputSize7000000-8   1.77k ± 5%
_ZenQ_NumWriters1000_InputSize7000000-8    46.5 ±27%
_Chan_Million_Blocking_Writers-8          2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8          1.00M ± 0%

@alphadose
Copy link
Author

alphadose commented Jun 10, 2022

@egonelbre Updates:-

operational memory usage reduced drastically (especially evident in the case of 1000 goroutines)

goos: darwin
goarch: arm64

name                                     time/op
_Chan_NumWriters1_InputSize600-8          23.4µs ± 1%
_ZenQ_NumWriters1_InputSize600-8          17.7µs ± 0%
_Chan_NumWriters3_InputSize60000-8        5.54ms ± 5%
_ZenQ_NumWriters3_InputSize60000-8        2.63ms ± 2%
_Chan_NumWriters8_InputSize6000000-8       687ms ± 2%
_ZenQ_NumWriters8_InputSize6000000-8       243ms ± 4%
_Chan_NumWriters100_InputSize6000000-8     1.59s ± 4%
_ZenQ_NumWriters100_InputSize6000000-8     296ms ± 2%
_Chan_NumWriters1000_InputSize7000000-8    1.97s ± 0%
_ZenQ_NumWriters1000_InputSize7000000-8    409ms ± 2%
_Chan_Million_Blocking_Writers-8           10.4s ± 4%
_ZenQ_Million_Blocking_Writers-8           1.83s ±10%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-8           0.00B
_ZenQ_NumWriters1_InputSize600-8           0.00B
_Chan_NumWriters3_InputSize60000-8          117B ±63%
_ZenQ_NumWriters3_InputSize60000-8        22.1B ±122%
_Chan_NumWriters8_InputSize6000000-8     1.01kB ±196%
_ZenQ_NumWriters8_InputSize6000000-8      1.12kB ±89%
_Chan_NumWriters100_InputSize6000000-8    42.6kB ±37%
_ZenQ_NumWriters100_InputSize6000000-8    11.3kB ±28%
_Chan_NumWriters1000_InputSize7000000-8    481kB ± 7%
_ZenQ_NumWriters1000_InputSize7000000-8   90.5kB ± 6%
_Chan_Million_Blocking_Writers-8           553MB ± 0%
_ZenQ_Million_Blocking_Writers-8           123MB ± 4%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-8            0.00
_ZenQ_NumWriters1_InputSize600-8            0.00
_Chan_NumWriters3_InputSize60000-8          0.00
_ZenQ_NumWriters3_InputSize60000-8          0.00
_Chan_NumWriters8_InputSize6000000-8       3.43 ±162%
_ZenQ_NumWriters8_InputSize6000000-8        5.23 ±53%
_Chan_NumWriters100_InputSize6000000-8       158 ±20%
_ZenQ_NumWriters100_InputSize6000000-8      26.3 ±29%
_Chan_NumWriters1000_InputSize7000000-8    1.76k ± 2%
_ZenQ_NumWriters1000_InputSize7000000-8     48.3 ±28%
_Chan_Million_Blocking_Writers-8           2.00M ± 0%
_ZenQ_Million_Blocking_Writers-8           1.00M ± 0%

@alphadose
Copy link
Author

alphadose commented Jun 11, 2022

Here are a few more benchmarks for different environments/systems

  1. windows, amd64, 16 cores
goos: windows
goarch: amd64
cpu: AMD Ryzen 7 5800H with Radeon Graphics

name                                      time/op
_Chan_NumWriters1_InputSize600-16         24.5µs ± 5%
_ZenQ_NumWriters1_InputSize600-16         17.7µs ± 2%
_Chan_NumWriters3_InputSize60000-16       4.75ms ± 3%
_ZenQ_NumWriters3_InputSize60000-16       1.86ms ± 1%
_Chan_NumWriters8_InputSize6000000-16      800ms ± 5%
_ZenQ_NumWriters8_InputSize6000000-16      150ms ± 1%
_Chan_NumWriters100_InputSize6000000-16    1.66s ± 1%
_ZenQ_NumWriters100_InputSize6000000-16    160ms ± 1%
_Chan_NumWriters1000_InputSize7000000-16   1.95s ± 1%
_ZenQ_NumWriters1000_InputSize7000000-16   269ms ± 1%
_Chan_Million_Blocking_Writers-16          5.79s ± 2%
_ZenQ_Million_Blocking_Writers-16          1.37s ± 6%

name                                      alloc/op
_Chan_NumWriters1_InputSize600-16          0.00B
_ZenQ_NumWriters1_InputSize600-16          0.00B
_Chan_NumWriters3_InputSize60000-16         150B ±57%
_ZenQ_NumWriters3_InputSize60000-16       20.6B ±201%
_Chan_NumWriters8_InputSize6000000-16      472B ±283%
_ZenQ_NumWriters8_InputSize6000000-16     1.05kB ±58%
_Chan_NumWriters100_InputSize6000000-16   43.7kB ±38%
_ZenQ_NumWriters100_InputSize6000000-16   29.7kB ±17%
_Chan_NumWriters1000_InputSize7000000-16   484kB ± 7%
_ZenQ_NumWriters1000_InputSize7000000-16   120kB ±14%
_Chan_Million_Blocking_Writers-16          553MB ± 0%
_ZenQ_Million_Blocking_Writers-16          128MB ± 4%

name                                      allocs/op
_Chan_NumWriters1_InputSize600-16           0.00
_ZenQ_NumWriters1_InputSize600-16           0.00
_Chan_NumWriters3_InputSize60000-16         0.00
_ZenQ_NumWriters3_InputSize60000-16         0.00
_Chan_NumWriters8_InputSize6000000-16      2.00 ±150%
_ZenQ_NumWriters8_InputSize6000000-16       3.90 ±28%
_Chan_NumWriters100_InputSize6000000-16      148 ±34%
_ZenQ_NumWriters100_InputSize6000000-16     68.3 ±24%
_Chan_NumWriters1000_InputSize7000000-16   1.79k ± 5%
_ZenQ_NumWriters1000_InputSize7000000-16    62.3 ±36%
_Chan_Million_Blocking_Writers-16          2.00M ± 0%
_ZenQ_Million_Blocking_Writers-16          1.00M ± 0%
  1. ubuntu, amd64, 16 cores
goos: linux
goarch: amd64
cpu: AMD Ryzen 7 5800H with Radeon Graphics

name                                      time/op
_Chan_NumWriters1_InputSize600-16         23.4µs ± 4%
_ZenQ_NumWriters1_InputSize600-16         33.1µs ± 4%
_Chan_NumWriters3_InputSize60000-16       2.59ms ± 3%
_ZenQ_NumWriters3_InputSize60000-16       1.79ms ± 1%
_Chan_NumWriters8_InputSize6000000-16      334ms ± 6%
_ZenQ_NumWriters8_InputSize6000000-16      162ms ± 4%
_Chan_NumWriters100_InputSize6000000-16    515ms ± 6%
_ZenQ_NumWriters100_InputSize6000000-16    170ms ± 3%
_Chan_NumWriters1000_InputSize7000000-16   1.76s ± 3%
_ZenQ_NumWriters1000_InputSize7000000-16   273ms ± 2%
_Chan_Million_Blocking_Writers-16          4.52s ± 5%
_ZenQ_Million_Blocking_Writers-16          1.27s ±14%

name                                      alloc/op
_Chan_NumWriters1_InputSize600-16          0.00B
_ZenQ_NumWriters1_InputSize600-16          0.00B
_Chan_NumWriters3_InputSize60000-16        91.7B ±51%
_ZenQ_NumWriters3_InputSize60000-16        4.00B ± 0%
_Chan_NumWriters8_InputSize6000000-16      487B ±275%
_ZenQ_NumWriters8_InputSize6000000-16      879B ±111%
_Chan_NumWriters100_InputSize6000000-16   30.0kB ±47%
_ZenQ_NumWriters100_InputSize6000000-16   23.2kB ±54%
_Chan_NumWriters1000_InputSize7000000-16   463kB ±11%
_ZenQ_NumWriters1000_InputSize7000000-16   129kB ±10%
_Chan_Million_Blocking_Writers-16          553MB ± 0%
_ZenQ_Million_Blocking_Writers-16          124MB ± 3%

name                                      allocs/op
_Chan_NumWriters1_InputSize600-16           0.00
_ZenQ_NumWriters1_InputSize600-16           0.00
_Chan_NumWriters3_InputSize60000-16         0.00
_ZenQ_NumWriters3_InputSize60000-16         0.00
_Chan_NumWriters8_InputSize6000000-16      1.57 ±219%
_ZenQ_NumWriters8_InputSize6000000-16       3.48 ±44%
_Chan_NumWriters100_InputSize6000000-16     87.8 ±40%
_ZenQ_NumWriters100_InputSize6000000-16     54.3 ±54%
_Chan_NumWriters1000_InputSize7000000-16   1.67k ± 9%
_ZenQ_NumWriters1000_InputSize7000000-16    63.5 ±10%
_Chan_Million_Blocking_Writers-16          2.00M ± 0%
_ZenQ_Million_Blocking_Writers-16          1.00M ± 0%

@NyaaaWhatsUpDoc
Copy link

I would be interested to see these benchmark results on other architectures, and those with less resources to work with. As a Ryzen 5800H is quite beefy compared to a lot of the systems these could be running on. Your implementation scales much better on the larger end of input/writers, but I'd be interested how the initialization overhead scales down on more constrained systems.

@alphadose
Copy link
Author

alphadose commented Jun 16, 2022

cool, I was collecting benchmarks for all kinds of systems and will definitely benchmark this in low end systems like raspberry pi.

I am gathering all benchmarks here

@alphadose
Copy link
Author

@NyaaaWhatsUpDoc apologies for the delay, here are the benchmarks for a low end device (32 bit raspberry pi)

Device Info

  `.::///+:/-.        --///+//-:``    alphadose@neverwinter
 `+oooooooooooo:   `+oooooooooooo:    -------------------
  /oooo++//ooooo:  ooooo+//+ooooo.    OS: Raspbian GNU/Linux 11 (bullseye) armv7l
  `+ooooooo:-:oo-  +o+::/ooooooo:     Host: Raspberry Pi 4 Model B Rev 1.5
   `:oooooooo+``    `.oooooooo+-      Kernel: 5.15.32-v7l+
     `:++ooo/.        :+ooo+/.`       Uptime: 1 hour, 58 mins
        ...`  `.----.` ``..           Packages: 569 (dpkg)
     .::::-``:::::::::.`-:::-`        Shell: bash 5.1.4
    -:::-`   .:::::::-`  `-:::-       Terminal: /dev/pts/0
   `::.  `.--.`  `` `.---.``.::`      CPU: BCM2711 (4) @ 1.800GHz
       .::::::::`  -::::::::` `       Memory: 68MiB / 3838MiB
 .::` .:::::::::- `::::::::::``::.
-:::` ::::::::::.  ::::::::::.`:::-
::::  -::::::::.   `-::::::::  ::::
-::-   .-:::-.``....``.-::-.   -::-
 .. ``       .::::::::.     `..`..
   -:::-`   -::::::::::`  .:::::`
   :::::::` -::::::::::` :::::::.
   .:::::::  -::::::::. ::::::::
    `-:::::`   ..--.`   ::::::.
      `...`  `...--..`  `...`
            .::::::::::
             `.-::::-`

Benchstat of 20 cases

goos: linux
goarch: arm
name                                     time/op
_Chan_NumWriters1_InputSize600-4          230µs ± 4%
_ZenQ_NumWriters1_InputSize600-4          186µs ± 5%
_Chan_NumWriters3_InputSize60000-4       28.2ms ± 3%
_ZenQ_NumWriters3_InputSize60000-4       12.8ms ± 0%
_Chan_NumWriters8_InputSize6000000-4      4.14s ±10%
_ZenQ_NumWriters8_InputSize6000000-4      1.32s ± 1%
_Chan_NumWriters100_InputSize6000000-4    5.97s ± 5%
_ZenQ_NumWriters100_InputSize6000000-4    1.48s ± 5%
_Chan_NumWriters1000_InputSize7000000-4   7.23s ± 6%
_ZenQ_NumWriters1000_InputSize7000000-4   2.09s ± 4%
_Chan_Million_Blocking_Writers-4          20.3s ± 2%
_ZenQ_Million_Blocking_Writers-4          6.96s ± 4%

name                                     alloc/op
_Chan_NumWriters1_InputSize600-4          0.00B
_ZenQ_NumWriters1_InputSize600-4          0.00B
_Chan_NumWriters3_InputSize60000-4         227B ±27%
_ZenQ_NumWriters3_InputSize60000-4        77.9B ±91%
_Chan_NumWriters8_InputSize6000000-4      499B ±189%
_ZenQ_NumWriters8_InputSize6000000-4     1.49kB ± 4%
_Chan_NumWriters100_InputSize6000000-4   27.5kB ±19%
_ZenQ_NumWriters100_InputSize6000000-4   27.7kB ±42%
_Chan_NumWriters1000_InputSize7000000-4   290kB ± 5%
_ZenQ_NumWriters1000_InputSize7000000-4   135kB ± 8%
_Chan_Million_Blocking_Writers-4          325MB ± 0%
_ZenQ_Million_Blocking_Writers-4         76.2MB ± 3%

name                                     allocs/op
_Chan_NumWriters1_InputSize600-4           0.00
_ZenQ_NumWriters1_InputSize600-4           0.00
_Chan_NumWriters3_InputSize60000-4         1.00 ± 0%
_ZenQ_NumWriters3_InputSize60000-4         0.00
_Chan_NumWriters8_InputSize6000000-4      4.30 ±109%
_ZenQ_NumWriters8_InputSize6000000-4       19.2 ± 9%
_Chan_NumWriters100_InputSize6000000-4      171 ±13%
_ZenQ_NumWriters100_InputSize6000000-4      194 ±25%
_Chan_NumWriters1000_InputSize7000000-4   1.84k ± 3%
_ZenQ_NumWriters1000_InputSize7000000-4   1.09k ± 4%
_Chan_Million_Blocking_Writers-4          2.00M ± 0%
_ZenQ_Million_Blocking_Writers-4          1.00M ± 0%

We can see that ZenQ scales well in both low end as well as high end devices

@alphadose
Copy link
Author

alphadose commented Aug 17, 2022

ZenQ's select{} performance has improved too since last time

❯ go run benchmarks/selector/main.go
With Input Batch Size: 60 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 52.084µs
ZenQ Select Runner completed transfer in: 60.584µs
====================================================================

With Input Batch Size: 600 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 160.708µs
ZenQ Select Runner completed transfer in: 349.792µs
====================================================================

With Input Batch Size: 6000 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 2.030959ms
ZenQ Select Runner completed transfer in: 3.499ms
====================================================================

With Input Batch Size: 600000 and Num Concurrent Writers: 4

Chan Select Runner completed transfer in: 174.827458ms
ZenQ Select Runner completed transfer in: 226.949375ms
====================================================================

For selection ZenQ uses golang interfaces which internally work via dynamic dispatch, this cost adds up and attributes to the slower performance as compared to channels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

6 participants