Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add container/queue #27935

Open
christianrpetrin opened this issue Sep 29, 2018 · 74 comments

Comments

@christianrpetrin
Copy link

commented Sep 29, 2018

Proposal: Built in support for high performance unbounded queue

Author: Christian Petrin.

Last updated: November 26, 2018

Discussion at: #27935

Design document at https://github.com/golang/proposal/blob/master/design/27935-unbounded-queue-package.md

Abstract

I propose to add a new package, "container/queue", to the standard library to support an in-memory, unbounded, general purpose queue implementation.

Queues in computer science is a very old, well established and well known concept, yet Go doesn't provide a specialized, safe to use, performant and issue free unbounded queue implementation.

Buffered channels provide an excellent option to be used as a queue, but buffered channels are bounded and so doesn't scale to support very large data sets. The same applies for the standard ring package.

The standard list package can be used as the underlying data structure for building unbounded queues, but the performance yielded by this linked list based implementation is not optimal.

Implementing a queue using slices as suggested here is a feasible approach, but the performance yielded by this implementation can be abysmal in some high load scenarios.

Background

Queues that grows dynamically has many uses. As an example, I'm working on a logging system called CloudLogger that sends all logged data to external logging management systems, such as Stackdriver and Cloudwatch. External logging systems typically rate limit how much data their service will accept for a given account and time frame. So in a scenario where the hosting application is logging more data than the logging management system will accept at a given moment, CloudLogger has to queue the extra logs and send them to the logging management system at a pace the system will accept. As there's no telling how much data will have to be queued as it depends on the current traffic, an unbounded, dynamically growing queue is the ideal data structure to be used. Buffered channels in this scenario is not ideal as they have a limit on how much data they will accept, and once that limit has been reached, the producers (routines adding to the channel) start to block, making the adding to the channel operation an "eventual" synchronous process. A fully asynchronous operation in this scenario is highly desirable as logging data should not slow down significantly the hosting application.

Above problem is a problem that, potentially, every system that calls another system faces. And in the cloud and microservices era, this is an extremely common scenario.

Due to the lack of support for built in unbounded queues in Go, Go engineers are left to either:

  1. Research and use external packages, or
  2. Build their own queue implementation

Both approaches are riddled with pitfalls.

Using external packages, especially in enterprise level software, requires a lot of care as using external, potentially untested and hard to understand code can have unwanted consequences. This problem is made much worse by the fact that, currently, there's no well established and disseminated open source Go queue implementation according to this stackoverflow discussion, this github search for Go queues and Awesome Go.

Building a queue, on the other hand, might sound like a compelling argument, but building efficient, high performant, bug free unbounded queue is a hard job that requires a pretty solid computer science foundation as well a good deal of time to research different design approaches, test different implementations, make sure the code is bug and memory leak free, etc.

In the end what Go engineers have been doing up to this point is building their own queues, which are for the most part inefficient and can have disastrous, yet hidden performance and memory issues. As examples of poorly designed and/or implemented queues, the approaches suggested here and here (among many others), requires linear copy of the internal slice for resizing purposes. Some implementations also has memory issues such as an ever expanding internal slice and memory leaks.

Proposal

I propose to add a new package, "container/queue", to the standard library to support in-memory unbounded queues. The proposed queue implementation offers excellent performance and very low memory consumption when comparing it to three promising open source implementations (gammazero, phf and juju); to use Go channels as queue; the standard list package as a queue as well as six other experimental queue implementations.

The proposed queue implementation offers the most balanced approach to performance given different loads, being significantly faster and still uses less memory than every other queue implementation in the tests.

The closest data structure Go has to offer for building dynamically growing queues for large data sets is the standard list package. When comparing the proposed solution to using the list package as an unbounded queue (refer to "BenchmarkList"), the proposed solution is consistently faster than using the list package as a queue as well as displaying a much lower memory footprint.

Reasoning

There's two well accepted approaches to implementing queues when in comes to the queue underlying data structure:

  1. Using linked list
  2. Using array

Linked list as the underlying data structure for an unbounded queue has the advantage of scaling efficiently when the underlying data structure needs to grow to accommodate more values. This is due to the fact that the existing elements doesn't need to be repositioned or copied around when the queue needs to grow.

However, there's a few concerns with this approach:

  1. The use of prev/next pointers for each value requires a good deal of extra memory
  2. Due to the fact that each "node" in the linked list can be allocated far away from the previous one, navigating through the list can be slow due to its bad memory locality properties
  3. Adding new values always require new memory allocations and pointers being set, hindering
    performance

On the other hand, using a slice as the underlying data structure for unbounded queues has the advantage of very good memory locality, making retrieval of values faster when comparing to linked lists. Also an "alloc more than needed right now" approach can easily be implemented with slices.

However, when the slice needs to expand to accommodate new values, a well adopted strategy is to allocate a new, larger slice, copy over all elements from the previous slice into the new one and use the new one to add the new elements.

The problem with this approach is the obvious need to copy all the values from the older, small slice, into the new one, yielding a poor performance when the amount of values that need copying are fairly large.

Another potential problem is a theoretical lower limit on how much data they can hold as slices, like arrays, have to allocate its specified positions in sequential memory addresses, so the maximum number of items the queue would ever be able to hold is the maximum size a slice can be allocated on that particular system at any given moment. Due to modern memory management techniques such as virtual memory and paging, this is a very hard scenario to corroborate thru practical testing.

Nonetheless, this approach doesn't scale well with large data sets.

Having said that, there's a third, newer approach to implementing unbounded queues: use fixed size linked slices as the underlying data structure.

The fixed size linked slices approach is a hybrid between the first two, providing good memory locality arrays have alongside the efficient growing mechanism linked lists offer. It is also not limited on the maximum size a slice can be allocated, being able to hold and deal efficiently with a theoretical much larger amount of data than pure slice based implementations.

Rationale

Research

A first implementation of the new design was built.

The benchmark tests showed the new design was very promising, so I decided to research about other possible queue designs and implementations with the goal to improve the first design and implementation.

As part of the research to identify the best possible queue designs and implementations, I implemented and probed a total of 7 experimental queue implementations. Below are a few of the most interesting ones.

  • queueimpl1: custom queue implementation that stores the values in a simple slice. Pop removes the first slice element. This is a slice based implementation that tests this suggestion.
  • queueimpl5: custom queue implementation that stores the values in linked slices. This implementation tests the queue performance when storing the "next" pointer as part of the values slice instead of having it as a separate "next" field. The next element is stored in the last position of the internal slices, which is a reserved position.
  • queueimpl7: custom queue implementation that stores the values in linked slices. This implementation tests the queue performance when performing lazy creation of the internal slice as well as starting with a 1-sized slice, allowing it to grow up to 16 by using the built in append function. Subsequent slices are created with 128 fixed size.

Also as part of the research, I investigated and probed below open source queue implementations as well.

  • phf: this is a slice, ring based queue implementation. Interesting to note the author did a pretty good job researching and probing other queue implementations as well.
  • gammazero: the deque implemented in this package is also a slice, ring based queue implementation.
  • juju: the deque implemented in this package uses a linked list based approach, similarly to other experimental implementations in this package such as queueimpl3. The biggest difference between this implementation and the other experimental ones is the fact that this queue uses the standard list package as the linked list. The standard list package implements a doubly linked list, while the experimental implementations implements their own singly linked list.

The standard list package as well as buffered channels were probed as well.

Benchmark Results

Add and remove 100 items
Performance

ns/op

Memory

B/op

Add and remove 100k items
Performance

ns/op

Memory

B/op

Aggregated Results
Performance

ns/op

Memory

B/op

Detailed, curated results can be found here

Aggregated, curated results can be found here

Given above results, queueimpl7, henceforth just "impl7", proved to be the most balanced implementation, being either faster or very competitive in all test scenarios from a performance and memory perspective.

Refer here for more details about the tests.

The benchmark tests can be found here.

Impl7 Design and Implementation

Impl7 was the result of the observation that some slice based implementations such as queueimpl1 and
queueimpl2 offers phenomenal performance when the queue is used with small data sets.

For instance, comparing queueimpl3 (very simple linked slice implementation) with queueimpl1 (very simple slice based implementation), the results at adding 0 (init time only), 1 and 10 items are very favorable for impl1, from a performance and memory perspective.

benchstat rawresults/bench-impl1.txt rawresults/bench-impl3.txt
name       old time/op    new time/op    delta
/0-4         6.83ns ± 3%  472.53ns ± 7%   +6821.99%  (p=0.000 n=20+17)
/1-4         48.1ns ± 6%   492.4ns ± 5%    +924.66%  (p=0.000 n=20+20)
/10-4         532ns ± 5%     695ns ± 8%     +30.57%  (p=0.000 n=20+20)
/100-4       3.19µs ± 2%    2.50µs ± 4%     -21.69%  (p=0.000 n=18+19)
/1000-4      24.5µs ± 3%    23.6µs ± 2%      -3.33%  (p=0.000 n=19+19)
/10000-4      322µs ± 4%     238µs ± 1%     -26.02%  (p=0.000 n=19+18)
/100000-4    15.8ms ±10%     3.3ms ±13%     -79.32%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          0.00B       2080.00B ± 0%       +Inf%  (p=0.000 n=20+20)
/1-4          16.0B ± 0%   2080.0B ± 0%  +12900.00%  (p=0.000 n=20+20)
/10-4          568B ± 0%     2152B ± 0%    +278.87%  (p=0.000 n=20+20)
/100-4       4.36kB ± 0%    2.87kB ± 0%     -34.13%  (p=0.000 n=20+20)
/1000-4      40.7kB ± 0%    24.6kB ± 0%     -39.54%  (p=0.000 n=20+20)
/10000-4      746kB ± 0%     244kB ± 0%     -67.27%  (p=0.000 n=20+20)
/100000-4    10.0MB ± 0%     2.4MB ± 0%     -75.85%  (p=0.000 n=15+20)

name       old allocs/op  new allocs/op  delta
/0-4           0.00           2.00 ± 0%       +Inf%  (p=0.000 n=20+20)
/1-4           1.00 ± 0%      2.00 ± 0%    +100.00%  (p=0.000 n=20+20)
/10-4          14.0 ± 0%      11.0 ± 0%     -21.43%  (p=0.000 n=20+20)
/100-4          108 ± 0%       101 ± 0%      -6.48%  (p=0.000 n=20+20)
/1000-4       1.01k ± 0%     1.01k ± 0%      +0.50%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%      +1.35%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%      +1.53%  (p=0.000 n=20+20)

Impl7 is a hybrid experiment between using a simple slice based queue implementation for small data sets and the fixed size linked slice approach for large data sets, which is an approach that scales really well, offering really good performance for small and large data sets.

The implementation starts by lazily creating the first slice to hold the first values added to the queue.

const (
    // firstSliceSize holds the size of the first slice.
    firstSliceSize = 1

    // maxFirstSliceSize holds the maximum size of the first slice.
    maxFirstSliceSize = 16

    // maxInternalSliceSize holds the maximum size of each internal slice.
    maxInternalSliceSize = 128
)

...

// Push adds a value to the queue.
// The complexity is amortized O(1).
func (q *Queueimpl7) Push(v interface{}) {
    if q.head == nil {
        h := newNode(firstSliceSize) // Returns a 1-sized slice.
        q.head = h
        q.tail = h
        q.lastSliceSize = maxFirstSliceSize
    } else if len(q.tail.v) >= q.lastSliceSize {
        n := newNode(maxInternalSliceSize) // Returns a 128-sized slice.
        q.tail.n = n
        q.tail = n
        q.lastSliceSize = maxInternalSliceSize
    }

    q.tail.v = append(q.tail.v, v)
    q.len++
}

...

// newNode returns an initialized node.
func newNode(capacity int) *Node {
    return &Node{
        v: make([]interface{}, 0, capacity),
    }
}

The very first created slice is created with capacity 1. The implementation allows the builtin append function to dynamically resize the slice up to 16 (maxFirstSliceSize) positions. After that it reverts to creating fixed size 128 position slices, which offers the best performance for data sets above 16 items.

16 items was chosen as this seems to provide the best balanced performance for small and large data sets according to the array size benchmark tests. Above 16 items, growing the slice means allocating a new, larger one and copying all 16 elements from the previous slice into the new one. The append function phenomenal performance can only compensate for the added copying of elements if the data set is very small, no more than 8 items in the benchmark tests. For above 8 items, the fixed size slice approach is consistently faster and uses less memory, where 128 sized slices are allocated and linked together when the data structure needs to scale to accommodate new values.

Why 16? Why not 15 or 14?

The builtin append function, as of "go1.11 darwin/amd64", seems to double the slice size every time it needs to allocate a new one.

ts := make([]int, 0, 1)

ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 1 item; output: 1

ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 2 items; output: 2

ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 3 items; output: 4

ts = append(ts, 1)
ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 5 items; output: 8

ts = append(ts, 1)
ts = append(ts, 1)
ts = append(ts, 1)
ts = append(ts, 1)
fmt.Println(cap(ts)) // Slice has 9 items; output: 16

Since the append function will resize the slice from 8 to 16 positions, it makes sense to use all 16 already allocated positions before switching to the fixed size slices approach.

Design Considerations

Impl7 uses linked slices as its underlying data structure.

The reason for the choice comes from two main observations of slice based queues:

  1. When the queue needs to expand to accommodate new values, a new, larger slice needs to be allocated and used
  2. Allocating and managing large slices is expensive, especially in an overloaded system with little available physical memory

To help clarify the scenario, below is what happens when a slice based queue that already holds, say 1bi items, needs to expand to accommodate a new item.

Slice based implementation

  • Allocate a new, twice the size of the previous allocated one, say 2 billion positions slice
  • Copy over all 1 billion items from the previous slice into the new one
  • Add the new value into the first unused position in the new slice, position 1000000001.

The same scenario for impl7 plays out like below.

Impl7

  • Allocate a new 128 size slice
  • Set the next pointer
  • Add the value into the first position of the new slice, position 0

Impl7 never copies data around, but slice based ones do, and if the data set is large, it doesn't matter how fast the copying algorithm is. The copying has to be done and will take some time.

The decision to use linked slices was also the result of the observation that slices goes to great length to provide predictive, indexed positions. A hash table, for instance, absolutely need this property, but not a queue. So impl7 completely gives up this property and focus on what really matters: add to end, retrieve from head. No copying around and repositioning of elements is needed for that. So when a slice goes to great length to provide that functionality, the whole work of allocating new arrays, copying data around is all wasted work. None of that is necessary. And this work costs dearly for large data sets as observed in the tests.

Impl7 Benchmark Results

Below compares impl7 with a few selected implementations.

The tests name are formatted given below.

  • Benchmark/N-4: benchmark a queue implementation where N denotes the number of items added and removed to/from the queue; 4 means the number of CPU cores in the host machine.

Examples:

  • Benchmark/0-4: benchmark the queue by creating a new instance of it. This only test initialization time.
  • Benchmark/100-4: benchmark the queue by creating a new instance of it and adding and removing 100 items to/from the queue.

Standard list used as a FIFO queue vs impl7.

benchstat rawresults/bench-list.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         34.9ns ± 1%     1.2ns ± 3%   -96.64%  (p=0.000 n=19+20)
/1-4         77.0ns ± 1%    68.3ns ± 1%   -11.21%  (p=0.000 n=20+20)
/10-4         574ns ± 0%     578ns ± 0%    +0.59%  (p=0.000 n=18+20)
/100-4       5.94µs ± 1%    3.07µs ± 0%   -48.28%  (p=0.000 n=19+18)
/1000-4      56.0µs ± 1%    25.8µs ± 1%   -53.92%  (p=0.000 n=20+20)
/10000-4      618µs ± 1%     260µs ± 1%   -57.99%  (p=0.000 n=20+18)
/100000-4    13.1ms ± 6%     3.1ms ± 3%   -76.50%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          48.0B ± 0%      0.0B       -100.00%  (p=0.000 n=20+20)
/1-4          96.0B ± 0%     48.0B ± 0%   -50.00%  (p=0.000 n=20+20)
/10-4          600B ± 0%      600B ± 0%      ~     (all equal)
/100-4       5.64kB ± 0%    3.40kB ± 0%   -39.72%  (p=0.000 n=20+20)
/1000-4      56.0kB ± 0%    25.2kB ± 0%   -55.10%  (p=0.000 n=20+20)
/10000-4      560kB ± 0%     243kB ± 0%   -56.65%  (p=0.000 n=20+20)
/100000-4    5.60MB ± 0%    2.43MB ± 0%   -56.66%  (p=0.000 n=18+20)

name       old allocs/op  new allocs/op  delta
/0-4           1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           2.00 ± 0%      2.00 ± 0%      ~     (all equal)
/10-4          20.0 ± 0%      15.0 ± 0%   -25.00%  (p=0.000 n=20+20)
/100-4          200 ± 0%       107 ± 0%   -46.50%  (p=0.000 n=20+20)
/1000-4       2.00k ± 0%     1.02k ± 0%   -48.95%  (p=0.000 n=20+20)
/10000-4      20.0k ± 0%     10.2k ± 0%   -49.20%  (p=0.000 n=20+20)
/100000-4      200k ± 0%      102k ± 0%   -49.22%  (p=0.000 n=20+20)

Impl7 is:

  • Up to ~29x faster (1.2ns vs 34.9ns) than list package for init time (0 items)
  • Up to ~4x faster (3.1ms vs 13.1ms) than list package for 100k items
  • Uses ~1/2 memory (2.43MB vs 5.60MB) than list package for 100k items

impl1 (simple slice based queue implementaion) vs impl7.

benchstat rawresults/bench-impl1.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         6.83ns ± 3%    1.18ns ± 3%   -82.79%  (p=0.000 n=20+20)
/1-4         48.1ns ± 6%    68.3ns ± 1%   +42.23%  (p=0.000 n=20+20)
/10-4         532ns ± 5%     578ns ± 0%    +8.55%  (p=0.000 n=20+20)
/100-4       3.19µs ± 2%    3.07µs ± 0%    -3.74%  (p=0.000 n=18+18)
/1000-4      24.5µs ± 3%    25.8µs ± 1%    +5.51%  (p=0.000 n=19+20)
/10000-4      322µs ± 4%     260µs ± 1%   -19.23%  (p=0.000 n=19+18)
/100000-4    15.8ms ±10%     3.1ms ± 3%   -80.60%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          0.00B          0.00B           ~     (all equal)
/1-4          16.0B ± 0%     48.0B ± 0%  +200.00%  (p=0.000 n=20+20)
/10-4          568B ± 0%      600B ± 0%    +5.63%  (p=0.000 n=20+20)
/100-4       4.36kB ± 0%    3.40kB ± 0%   -22.02%  (p=0.000 n=20+20)
/1000-4      40.7kB ± 0%    25.2kB ± 0%   -38.25%  (p=0.000 n=20+20)
/10000-4      746kB ± 0%     243kB ± 0%   -67.47%  (p=0.000 n=20+20)
/100000-4    10.0MB ± 0%     2.4MB ± 0%   -75.84%  (p=0.000 n=15+20)

name       old allocs/op  new allocs/op  delta
/0-4           0.00           0.00           ~     (all equal)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          14.0 ± 0%      15.0 ± 0%    +7.14%  (p=0.000 n=20+20)
/100-4          108 ± 0%       107 ± 0%    -0.93%  (p=0.000 n=20+20)
/1000-4       1.01k ± 0%     1.02k ± 0%    +1.09%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.39%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.54%  (p=0.000 n=20+20)

Impl7 is:

  • Up to ~5x faster (1.18ns vs 6.83ns) than impl1 for init time (0 items)
  • Up to ~5x faster (3.1ms vs 15.8ms) than impl1 for 100k items
  • Uses ~1/4 memory (2.4MB vs 10MB) than impl1 for 100k items

It's important to note that the performance and memory gains for impl7 is exponential like the larger the data set is due to the fact slice based implementations doesn't scale well, paying a higher and higher price, performance and memory wise, every time it needs to scale to accommodate an ever
expanding data set.


phf (slice, ring based FIFO queue implementation) vs impl7.

benchstat rawresults/bench-phf.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         28.1ns ± 1%     1.2ns ± 3%   -95.83%  (p=0.000 n=20+20)
/1-4         42.5ns ± 1%    68.3ns ± 1%   +60.80%  (p=0.000 n=20+20)
/10-4         681ns ± 1%     578ns ± 0%   -15.11%  (p=0.000 n=18+20)
/100-4       4.55µs ± 1%    3.07µs ± 0%   -32.45%  (p=0.000 n=19+18)
/1000-4      35.5µs ± 1%    25.8µs ± 1%   -27.32%  (p=0.000 n=18+20)
/10000-4      349µs ± 2%     260µs ± 1%   -25.67%  (p=0.000 n=20+18)
/100000-4    11.7ms ±11%     3.1ms ± 3%   -73.77%  (p=0.000 n=20+20)

name       old alloc/op   new alloc/op   delta
/0-4          16.0B ± 0%      0.0B       -100.00%  (p=0.000 n=20+20)
/1-4          16.0B ± 0%     48.0B ± 0%  +200.00%  (p=0.000 n=20+20)
/10-4          696B ± 0%      600B ± 0%   -13.79%  (p=0.000 n=20+20)
/100-4       6.79kB ± 0%    3.40kB ± 0%   -49.94%  (p=0.000 n=20+20)
/1000-4      57.0kB ± 0%    25.2kB ± 0%   -55.86%  (p=0.000 n=20+20)
/10000-4      473kB ± 0%     243kB ± 0%   -48.68%  (p=0.000 n=20+20)
/100000-4    7.09MB ± 0%    2.43MB ± 0%   -65.77%  (p=0.000 n=18+20)

name       old allocs/op  new allocs/op  delta
/0-4           1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          15.0 ± 0%      15.0 ± 0%      ~     (all equal)
/100-4          111 ± 0%       107 ± 0%    -3.60%  (p=0.000 n=20+20)
/1000-4       1.02k ± 0%     1.02k ± 0%    +0.39%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.38%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.54%  (p=0.000 n=20+20)

Impl7 is:

  • Up to ~23x faster (1.2ns vs 28.1ns) than phf for init time (0 items)
  • Up to ~3x faster (3.1ms vs 11.7ms) than phf for 100k items
  • Uses ~1/2 memory (2.43MB vs 7.09MB) than phf for 100k items

Buffered channel vs impl7.

benchstat rawresults/bench-channel.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         30.2ns ± 1%     1.2ns ± 3%   -96.12%  (p=0.000 n=19+20)
/1-4         87.6ns ± 1%    68.3ns ± 1%   -22.00%  (p=0.000 n=19+20)
/10-4         704ns ± 1%     578ns ± 0%   -17.90%  (p=0.000 n=20+20)
/100-4       6.78µs ± 1%    3.07µs ± 0%   -54.70%  (p=0.000 n=20+18)
/1000-4      67.3µs ± 1%    25.8µs ± 1%   -61.65%  (p=0.000 n=20+20)
/10000-4      672µs ± 1%     260µs ± 1%   -61.36%  (p=0.000 n=19+18)
/100000-4    6.76ms ± 1%    3.07ms ± 3%   -54.61%  (p=0.000 n=19+20)

name       old alloc/op   new alloc/op   delta
/0-4          96.0B ± 0%      0.0B       -100.00%  (p=0.000 n=20+20)
/1-4           112B ± 0%       48B ± 0%   -57.14%  (p=0.000 n=20+20)
/10-4          248B ± 0%      600B ± 0%  +141.94%  (p=0.000 n=20+20)
/100-4       1.69kB ± 0%    3.40kB ± 0%  +101.42%  (p=0.000 n=20+20)
/1000-4      16.2kB ± 0%    25.2kB ± 0%   +55.46%  (p=0.000 n=20+20)
/10000-4      162kB ± 0%     243kB ± 0%   +49.93%  (p=0.000 n=20+20)
/100000-4    1.60MB ± 0%    2.43MB ± 0%   +51.43%  (p=0.000 n=16+20)

name       old allocs/op  new allocs/op  delta
/0-4           1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.000 n=20+20)
/10-4          10.0 ± 0%      15.0 ± 0%   +50.00%  (p=0.000 n=20+20)
/100-4          100 ± 0%       107 ± 0%    +7.00%  (p=0.000 n=20+20)
/1000-4       1.00k ± 0%     1.02k ± 0%    +2.10%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.61%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.57%  (p=0.000 n=20+20)

Impl7 is:

  • Up to ~25x faster (1.2ns vs 30.2ns) than channels for init time (0 items)
  • Up to ~2x faster (3.07ms vs 6.76ms) than channels for 100k items
  • Uses ~50% MORE memory (2.43MB vs 1.60MB) than channels for 100k items

Above is not really a fair comparison as standard buffered channels doesn't scale (at all) and they are meant for routine synchronization. Nonetheless, they can and make for an excellent bounded FIFO queue option. Still, impl7 is consistently faster than channels across the board, but uses considerably more memory than channels.


Given its excellent performance under all scenarios, the hybrid approach impl7 seems to be the ideal candidate for a high performance, low memory footprint general purpose FIFO queue.

For above reasons, I propose to port impl7 to the standard library.

All raw benchmark results can be found here.

Internal Slice Size

Impl7 uses linked slices as its underlying data structure.

The size of the internal slice does influence performance and memory consumption significantly.

According to the internal slice size bench tests, larger internal slice sizes yields better performance and lower memory footprint. However, the gains diminishes dramatically as the slice size increases.

Below are a few interesting results from the benchmark tests.

BenchmarkMaxSubsequentSliceSize/1-4                20000         76836 ns/op       53967 B/op       2752 allocs/op
BenchmarkMaxSubsequentSliceSize/2-4                30000         59811 ns/op       40015 B/op       1880 allocs/op
BenchmarkMaxSubsequentSliceSize/4-4                30000         42925 ns/op       33039 B/op       1444 allocs/op
BenchmarkMaxSubsequentSliceSize/8-4                50000         36946 ns/op       29551 B/op       1226 allocs/op
BenchmarkMaxSubsequentSliceSize/16-4               50000         30597 ns/op       27951 B/op       1118 allocs/op
BenchmarkMaxSubsequentSliceSize/32-4               50000         28273 ns/op       27343 B/op       1064 allocs/op
BenchmarkMaxSubsequentSliceSize/64-4               50000         26969 ns/op       26895 B/op       1036 allocs/op
BenchmarkMaxSubsequentSliceSize/128-4              50000         27316 ns/op       26671 B/op       1022 allocs/op
BenchmarkMaxSubsequentSliceSize/256-4              50000         26221 ns/op       28623 B/op       1016 allocs/op
BenchmarkMaxSubsequentSliceSize/512-4              50000         25882 ns/op       28559 B/op       1012 allocs/op
BenchmarkMaxSubsequentSliceSize/1024-4             50000         25674 ns/op       28527 B/op       1010 allocs/op

Given the fact that larger internal slices also means potentially more unused memory in some scenarios, 128 seems to be the perfect balance between performance and worst case scenario for memory footprint.

Full results can be found here.

API

Impl7 implements below API methods.

Operation Method
Add func (q *Queueimpl7) Push(v interface{})
Remove func (q *Queueimpl7) Pop() (interface{}, bool)
Size func (q *Queueimpl7) Len() int
Return First func (q *Queueimpl7) Front() (interface{}, bool)

As nil values are considered valid queue values, similarly to the map data structure, "Front" and "Pop" returns a second bool parameter to indicate whether the returned value is valid and whether the queue is empty or not.

The reason for above method names and signatures are the need to keep compatibility with existing Go data structures such as the list, ring and heap packages.

Below are the method names used by the existing list, ring and heap Go data structures, as well as the new proposed queue.

Operation list ring heap queue
Add PushFront/PushBack Link Push Push
Remove Remove Unlink Pop Pop
Size Len Len - Len
Return First Front - - Front

For comparison purposes, below are the method names for C++, Java and C# for their queue implementation.

Operation C++ Java C#
Add push add/offer Enqueue
Remove pop remove/poll Dequeue
Size size - Count
Return First front peek Peek

Drawbacks

The biggest drawback of the proposed implementation is the potentially extra allocated but not used memory in its head and tail slices.

This scenario realizes when exactly 17 items are added to the queue, causing the creation of a full sized internal slice of 128 positions. Initially only the first element in this new slice is used to store the added value. All the other 127 elements are already allocated, but not used.

// Assuming a 128 internal sized slice.
q := queueimpl7.New()

// Push 16 items to fill the first dynamic slice (sized 16).
for i := 1; i <= 16; i++ {
   q.Push(i)
}
// Push 1 extra item that causes the creation of a new 128 sized slice to store this value.
q.Push(17)

// Pops the first 16 items to release the first slice (sized 16).
for i := 1; i <= 16; i++ {
   q.Pop()
}

// As unsafe.Sizeof (https://golang.org/pkg/unsafe/#Sizeof) doesn't consider the length of slices,
// we need to manually calculate the memory used by the internal slices.
var internalSliceType interface{}
fmt.Println(fmt.Sprintf("%d bytes", unsafe.Sizeof(q)+(unsafe.Sizeof(internalSliceType) /* bytes per slice position */ *127 /* head slice unused positions */)))

// Output for a 64bit system (Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz): 2040 bytes

The worst case scenario realizes when exactly 145 items are added to the queue and 143 items are removed. This causes the queue struct to hold a 128-sized slice as its head slice, but only the last element is actually used. Similarly, the queue struct will hold a separate 128-sized slice as its tail slice, but only the first position in that slice is being used.

// Assuming a 128 internal sized slice.
q := queueimpl7.New()

// Push 16 items to fill the first dynamic slice (sized 16).
for i := 1; i <= 16; i++ {
   q.Push(i)
}

// Push an additional 128 items to fill the first full sized slice (sized 128).
for i := 1; i <= 128; i++ {
   q.Push(i)
}

// Push 1 extra item that causes the creation of a new 128 sized slice to store this value,
// adding a total of 145 items to the queue.
q.Push(1)

// Pops the first 143 items to release the first dynamic slice (sized 16) and
// 127 items from the first full sized slice (sized 128).
for i := 1; i <= 143; i++ {
   q.Pop()
}

// As unsafe.Sizeof (https://golang.org/pkg/unsafe/#Sizeof) doesn't consider the length of slices,
// we need to manually calculate the memory used by the internal slices.
var internalSliceType interface{}
fmt.Println(fmt.Sprintf("%d bytes", unsafe.Sizeof(q)+(unsafe.Sizeof(internalSliceType) /* bytes per slice position */ *(127 /* head slice unused positions */ +127 /* tail slice unused positions */))))

// Output for a 64bit system (Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz): 4072 bytes

Above code was run on Go version "go1.11 darwin/amd64".

Open Questions/Issues

Should this be a deque (double-ended queue) implementation instead? The deque could be used as a stack as well, but it would make more sense to have a queue and stack implementations (like most mainstream languages have) instead of a deque that can be used as a stack (confusing). Stack is a very important computer science data structure as well and so I believe Go should have a specialized implementation for it as well (given the specialized implementation offers real value to the users and not just a "nice" named interface and methods).

Should "Pop" and "Front" return only the value instead of the value and a second bool parameter (which indicates whether the queue is empty or not)? The implication of the change is adding nil values wouldn't be valid anymore so "Pop" and "Front" would return nil when the queue is empty. Panic should be avoided in libraries.

The memory footprint for a 128 sized internal slice causes, in the worst case scenario, a 2040 bytes of memory allocated (on a 64bit system) but not used. Switching to 64 means roughly half the memory would be used with a slight ~2.89% performance drop (252813ns vs 260137ns). The extra memory footprint is not worth the extra performance gain is a very good point to make. Should we change this value to 64 or maybe make it configurable?

Should we also provide a safe for concurrent use implementation? A specialized implementation that would rely on atomic operations to update its internal indices and length could offer a much better performance when comparing to a similar implementation that relies on a mutex.

With the impending release of generics, should we wait to release the new queue package once the new generics framework is released?

Should we implement support for the range keyword for the new queue? It could be done in a generic way so other data structures could also benefit from this feature. For now, IMO, this is a topic for another proposal/discussion.

Summary

I propose to add a new package, "container/queue", to the standard library to support an in-memory, unbounded, general purpose queue implementation.

I feel strongly this proposal should be accepted due to below reasons.

  1. The proposed solution was well researched and probed, being dramatically and consistently faster than 6 other experimental queue implementations as well 3 promising open source queue implementations as well the standard list package and buffered channels; it still consumes considerably less memory than every other queue implementation tested, except for buffered channels
  2. The proposed solution uses a new, unique approach to building queues, yet its implementation is clean and extremely simple. Both main methods, "Push" and "Pop", are composed of only 16 and 19 lines of code (total), respectively. The proposed implementation also have proper tests with 100% test coverage and should require minimal maintenance moving forward
  3. I'll implement any changes the Go community feel are needed for the proposed solution to be worth of the standard library and the Go community

Update 11/26/2018.

Due to many suggestions to make the queue a deque and to deploy it as a proper external package, the deque package was built and deployed here. The proposal now is to add the deque package to the standard library instead of impl7. Refer here for details.

@gopherbot gopherbot added this to the Proposal milestone Sep 29, 2018

@gopherbot gopherbot added the Proposal label Sep 29, 2018

@cznic

This comment has been minimized.

Copy link
Contributor

commented Sep 29, 2018

Unbound queues don't fit well machines with finite memory.

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Sep 30, 2018

@cznic I get your point, but I disagree with it for a number of reasons.

  1. Modern devices, even mobile ones, comes with gigabytes of memory available. Efficient data structures that allow engineers to use that much memory efficiently are very welcome
  2. Go already offers a few unbounded data structures such as list, map and others. I don't understand why Go should stop now
  3. In modern cloud computing, it is fairly easy to grow memory usage by distributing the growth to more than one device. The technique I'm using on CloudLogger, for instance, is to allow for infinite memory growth in a single VM, but configuring Kubernetes to auto-scale based on memory. So when a node reaches a certain limit, say 50%, of its used memory, Kubernetes will deploy a new pod and start to round robin the requests (which causes the memory usage to go up) between the old and new pod. This way I effectively (and automatically, with no code changes) can scale one device memory into many others. Techniques like this are bread and butter of modern cloud computing. An amazing language like Go should not restrict the engineers in being able to use such techniques

Maybe the term "unbounded" is a bit confusing. Maybe we should change from "unbounded queue" to "dynamically growing queue". Dynamically growing, which in the queue scenario means the same thing as unbounded, would probably be less confusing. Thoughts?

@networkimprov

This comment has been minimized.

Copy link

commented Sep 30, 2018

Agreed that the stdlib or golang.org/x repo would benefit from containers like queue and btree. It may become easier to land new containers once generics lands, tho that's a long way off yet.

Dynamic or "elastic" channel buffers have been proposed (and declined) in two forms: a channel buffer pool #20868, and unlimited channel buffer #20352. A workable alternative is one-goroutine-plus-two-channels as in this elastic channel example. It might be useful to include that in your benchmark set.

Container structures are typically unbounded/elastic, so that's unlikely to be a recurring objection here :-)

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Sep 30, 2018

@networkimprov, thanks for posting the other channel proposals, and agreeing with the queue proposal! :-)

I looked into https://github.com/kylelemons/iq and both queue designs implemented here, a slice, ring based design as well as the dynamic channel.

I have benchmarked two high quality open source, similarly slice ring based implementations: phf and gammazero.

I have also benchmarked channels as queue.

Impl7 is dramatically faster than all ring slice based implementations tested as well as channels.

Having said that, I tried to benchmark both implementations, but the design for both queues is so "unique" I had trouble getting both implementations to fit my benchmark tests.

I couldn't help but to notice the "Infinite Queue implementations in Go. These are bad news, you should always use bounded queues, but they are here for educational purposes." title on the repo.

Blanket statements like this that are based on faulty research really doesn't help. I wish the author would do a better job researching other design and implementation options before making such statements.

@cznic

This comment has been minimized.

Copy link
Contributor

commented Sep 30, 2018

Blanket statements like this that are based on faulty research really doesn't help.

The above is a blanket statement because it claims some undefined research is faulty without providing any evidence ;-)

@phf

This comment has been minimized.

Copy link
Contributor

commented Sep 30, 2018

Thanks for noticing my queue. I obviously agree that a queue should be in the standard library. Personally I believe that a deque is what you want in the standard library. It's usually not much more work to implement (or much more complex code-wise) but it's twice as useful and not (in principle) harder to make fast. The proposal sounds good and I wish it success. Oh, and congrats on beating my queue! :-)

@wsc1

This comment has been minimized.

Copy link

commented Sep 30, 2018

Nice queue!

However, my feeling is that the state of affairs of containers and the standard library is such that a global assessment of what the standard library should provide for containers before considering inclusion based on the merits of a single implementation of a single data structure.

@networkimprov

This comment has been minimized.

Copy link

commented Oct 1, 2018

Re benchmarking kylelemons/iq, its two channels simulate a single standard channel (one is input, the other output). But you already knew that, so the issue must be the goroutine?

Re its claim, "you should always use bounded queues," that means channel queues. As channels are principally an "IGrC" mechanism, if the consumer goroutine stops, the producer/s must block on c <- x (iow feel "backpressure").

I gather the API isn't thread safe? Have you benchmarked a concurrent scenario, e.g. multiple pushers & single popper? That's where a comparison with kylelemons/iq is most relevant. (And that's how I'd use your queue.)

Related: @ianlancetaylor once voiced support for the idea of allocating a bounded channel buffer as-needed, and your list-of-slices technique might apply there. There's not a separate proposal for that as yet.

@egonelbre

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2018

I remember https://github.com/karalabe/cookiejar/tree/master/collections having a decent implementation.

Good concurrent queue will need several things i.e. lock-freedom in the happy bath, non-spinning when blocking. For different concurrent situations (MPMC, SPMC, MPSC, SPSC) and high-performance, the code would also look different; since not having to manage concurrency at a particular producer/consumer helps to speed up things.

As for tests you should also test the slowly increasing cases (i.e. 2 x Push, 1 x Pop or 3 x Push, 2 x Pop) and slowly decreasing cases (i.e. fill queue, then 1 x Push, 2 x Pop or 2 x Push, 3 x Pop). Also this document is not including the stable-cases (fill 1e6, or larger depending on internal block size, then repeat for N 1 x Push, 2 x Pop). Also missing large cases with 10e6 items. And average might not be the best measurement for these, look at percentiles also.

@bcmills

This comment has been minimized.

Copy link
Member

commented Oct 1, 2018

Should this be a deque (double-ended queue) implementation instead?

I believe that the two-array deque is conventional in language libraries. It tends to provide better cache behavior than a linked-list approach, and a lower cost (particularly in the face of mixed push-front/push-back usage) than a single-array queue.

@bcmills

This comment has been minimized.

Copy link
Member

commented Oct 1, 2018

Good concurrent queue will need several things i.e. lock-freedom in the happy bath, non-spinning when blocking.

“Lock-freedom” is not particularly meaningful for programs that run on multiple cores. The CPU's cache coherence protocol has many of the same properties as a lock.

@bcmills

This comment has been minimized.

Copy link
Member

commented Oct 1, 2018

As @cznic notes, unbounded queues are not usually what you want in a well-behaved program anyway.

For example, in a Logger implementation, you don't want to consume arbitrary memory and crash (and lose your pending logs) if one of your logging backends can't keep up. If the logs are important, it's better to apply flow-control and delay the program until they can be sent. If the logs are not important, it's better to drop some logs when the buffer gets full, rather than dropping the entire program state.

@bcmills

This comment has been minimized.

Copy link
Member

commented Oct 1, 2018

Given the number and variety of queue implementations that already exist, and their variation on properties such as push-front support and concurrency-safety, I would prefer that we not add a queue to the standard library — especially not before we figure out what's going on with generics (#15292).

@uluyol

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2018

I agree that when used for concurrency, unbounded queues are a bad fit. However, unbounded queues are useful for sequential work e.g. when doing a breadth first search. We use slices that can grow arbitrarily, maps that can grow arbitrarily, and stacks that can grow arbitrarily, why can't the programmer be trusted to use them when appropriate?

I don't understand what the push back is on the idea (though I agree that it should wait on generics). An unbounded queue is not an unbounded channel.

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 1, 2018

Thanks @egonelbre for pointing out to this https://github.com/karalabe/cookiejar/blob/master/collections/queue/queue.go
queue implementation. This is an interesting design based on a dynamically growing circular slice of blocks.

I have added this queue to the bench tests. Feel free to clone the repo and run the tests to validate below results by yourself.

benchstat rawresults/bench-cookiejar.txt rawresults/bench-impl7.txt
name       old time/op    new time/op    delta
/0-4         9.47µs ± 0%    0.00µs ± 3%   -99.99%  (p=0.000 n=19+20)
/1-4         9.52µs ± 1%    0.07µs ± 1%   -99.28%  (p=0.000 n=18+20)
/10-4        9.76µs ± 1%    0.58µs ± 0%   -94.08%  (p=0.000 n=19+20)
/100-4       11.6µs ± 1%     3.1µs ± 0%   -73.54%  (p=0.000 n=20+18)
/1000-4      29.6µs ± 2%    25.8µs ± 1%   -12.71%  (p=0.000 n=20+20)
/10000-4      231µs ± 1%     260µs ± 1%   +12.56%  (p=0.000 n=19+18)
/100000-4    3.21ms ±13%    3.07ms ± 3%    -4.41%  (p=0.000 n=18+20)

name       old alloc/op   new alloc/op   delta
/0-4         65.6kB ± 0%     0.0kB       -100.00%  (p=0.000 n=20+20)
/1-4         65.6kB ± 0%     0.0kB ± 0%   -99.93%  (p=0.000 n=20+20)
/10-4        65.6kB ± 0%     0.6kB ± 0%   -99.09%  (p=0.000 n=20+20)
/100-4       66.4kB ± 0%     3.4kB ± 0%   -94.88%  (p=0.000 n=20+20)
/1000-4      73.6kB ± 0%    25.2kB ± 0%   -65.80%  (p=0.000 n=20+20)
/10000-4      277kB ± 0%     243kB ± 0%   -12.29%  (p=0.000 n=20+20)
/100000-4    2.45MB ± 0%    2.43MB ± 0%    -0.79%  (p=0.000 n=20+20)

name       old allocs/op  new allocs/op  delta
/0-4           2.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
/1-4           2.00 ± 0%      2.00 ± 0%      ~     (all equal)
/10-4          11.0 ± 0%      15.0 ± 0%   +36.36%  (p=0.000 n=20+20)
/100-4          101 ± 0%       107 ± 0%    +5.94%  (p=0.000 n=20+20)
/1000-4       1.00k ± 0%     1.02k ± 0%    +2.00%  (p=0.000 n=20+20)
/10000-4      10.0k ± 0%     10.2k ± 0%    +1.56%  (p=0.000 n=20+20)
/100000-4      100k ± 0%      102k ± 0%    +1.52%  (p=0.000 n=20+20)

This queue does have good performance for fairly large data sets, but its performance for small data sets, performance and memory wise, are not very good.

This is a very good implementation, I really like it, but:

  1. Due to its poor performance with small data sets, I would not regard this as a well balanced, general purpose queue
  2. I have concerns regarding memory leaks on this queue. I don't see anywhere in the code where the
    just released position in the internal slice gets set to nil
  3. This queue suffers from the same problem as every other pure slice based queues suffer: copying of existing elements when the queue needs to grow
@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 1, 2018

@networkimprov the kylelemons/iq is a very unique design which seems to me was built to prove a point the channels does not make good unbounded queues.

I can't agree more with this. Channels are meant for routine synchronization, not as a sequential data store like a traditional queue can, which by the way, is a core use case for queues (if you can't process fast enough, store). That's why I was actually adamant in including buffered channels in the tests: they are not meant to be used as queues. If people really want to use them as queues, then a separate proposal with a separate discussion should follow. This is not a "dynamic channel" proposal. Please keep that in mind.

Regarding the claim, "you should always use bounded queues,", I understood (by reading the code) that he/she meant channels. My point was that his/her statement is not clear and bounded enough, so people reading might think no unbounded queues, regardless of its design, implementation, performance, etc are good. I can't agree with that.

Impl7 is not safe for concurrent use. As the map and all stdlib containers are not safe for concurrent use, the new queue should follow on the pattern. Not all users will need a safe for concurrent use queue. So IMO, a safe for concurrent use version is a topic for another proposal/discussion just like we have a separate sync.Map for a safe for concurrent use map implementation.

I do talk a bit about the subject on the design document though.

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 1, 2018

As @cznic notes, unbounded queues are not usually what you want in a well-behaved program anyway.

For example, in a Logger implementation, you don't want to consume arbitrary memory and crash (and lose your pending logs) if one of your logging backends can't keep up. If the logs are important, it's better to apply flow-control and delay the program until they can be sent. If the logs are not important, it's better to drop some logs when the buffer gets full, rather than dropping the entire program state.

That's sort of the plan. Take a look at this response. I can't apply flow control all the way to the clients. This would make the hosting system slower, affecting its clients (the human beings). Non-critical infrastructure such as logging should not affect the hosting system significantly. This is a problem that has no solution other than to either queue or discard the logs. Discard should be the very last approach, only used when queueing is no longer possible. FWIW, CloudLogger already treats important logs in a very different way than regular logs, so even if one of the backend servers dies and a whole bunch of logs get lost, that's by design. All the important logs are not long term cached and will not be lost (in large numbers, at least).

Also I get your concerns around using too much memory, but think of this queue as a data structure that is highly efficient also in dealing with large amounts of data, gigabytes of data. It has to be unbounded because each device has a certain amount of available memory to be used at any given time. It should be up to the engineer using the queue to decide how much memory he/she wants to use. It's the job of the language designers to provide tools that allow the engineers to handle their needs with the maximum possible efficient, and not dictate how much memory their programs should use.

If we do have and can provide to all Go engineers a data structure that is flexible enough they know they will get really good performance no matter their needs, I think that data structure brings real value. Go engineers have been building their queues for a long time now. Often with (pretty bad results)[https://github.com/christianrpetrin/queue-tests/blob/master/bench_queue.md]. That's also my point here. Go engineers are already doing "the bad" thing of using a lot of memory. The proposed solution just allows them to keep doing that in a much more efficient way.

Again, I'm not proposing a super crazy, specialized data structure here. Coming from Java and C#, I was super surprised Go doesn't offer a traditional queue implementation. Having a traditional queue implementation, especially a super flexible and fast one, also helps bring all those Java, C#, etc devs to the Go world.

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 1, 2018

Given the number and variety of queue implementations that already exist, and their variation on properties such as push-front support and concurrency-safety, I would prefer that we not add a queue to the standard library — especially not before we figure out what's going on with generics (#15292).

@bcmills please consider below.

  1. The queue proposed here is dramatically faster than every other queue tested. If you known of another queue implementation you feel like might be a better design/implementation than the proposed here, please let me know and I'm extremely happy to validate it. If we find a faster, balanced queue implementation, I'm more than happy to root for that implementation to make into the stdlib. As a matter of fact, I'd definitely use it in all my projects
  2. @networkimprov mentioned generics are a long way off. Should we wait and hold off improvements until we finally decide what we want to do with generics? How long have we been talking about generics? How much longer will it take? IMO we should keep on with an agile/SCRUM like approach to Go development where small, incremental improvements are constantly built and released. That's one of the reasons why Go is beating every other mainstream languages out there. We should allow and foster organic growth and constant small improvements to the language. High level discussions do have their place and should exist, but they can take place unilaterally and should not hold off development for very long.
  3. I'm very happy to implement changes the proposed the queue to make it better. All suggestions and criticism are most welcome. Especially the people who is thumbing down my proposal, I'm not seeing many comments why the proposal is not good. Please let us know why you disagree with the proposal and reply back to my comments if I do so, so we can have a constructive conversation.
@cznic

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2018

Slices can be used as queues, but often they're not. Sometimes a program collects items and sort them, for example. A bounded slice is not helpful in this case. If the sorting requires more memory than available, there's no other option, in the first approximation, than to crash.

Maps serve a different purpose, but the argument about bound/unbound is the same.

There are many other general data structures with different big-O characteristics of their particular operations. One usually carefully selects the proper one and the argument about bound/unbound is still the same.

Channels typically serve as a queue between concurrently executing producers and consumers. They have a limit that blocks producers when reached and provides useful back pressure. Making a channel unlimited loses that property.

A non-channel queue (FIFO) offers a rather limited set of operations: Put and Get (or whatever the names were chosen). If it is to be used in the scenario where a slice, map or other generic data structure can be used, then there's no real need to have or use it.

The remaining scenarios for using a FIFO are IMO mostly - if not only - where the producers and consumers execute concurrently. If a FIFO is used effectively as a channel there are two important questions to answer:

  • Why not use a channel in the first place?
  • Why the back pressure property is useful for channels but nut useful for a FIFO used as a channel?

Also note that slices and maps that exist today in the language are type generic. The proposed unbound queue seems to use interface{} typed items. That adds memory overhead and hurts cache locality. I haven't really studied the code, but in a proper benchmark, []interface{} cannot beat []int in any of those two aspects. Similarly for []interface{} vs a buffered chan int.

@bcmills

This comment has been minimized.

Copy link
Member

commented Oct 1, 2018

If you known of another queue implementation you feel like might be a better design/implementation than the proposed here, please let me know and I'm extremely happy to validate it.

See https://golang.org/cl/94137 (which I ought to dust off and submit one of these days).

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 1, 2018

Also to help validate my point of the whole bunch of problematic Go queues out there, I was able to validate the cookiejar queue actually has a memory leak. Just ran some bench tests locally with large items added.

go test -benchmem -count 1 -benchtime 60s -timeout 60m -bench=".*Cookiejar" -run=^$
goos: darwin
goarch: amd64
pkg: github.com/christianrpetrin/queue-tests
BenchmarkCookiejar/1000000-4         	    2000	  44099280 ns/op	24830036 B/op	 1000489 allocs/op
BenchmarkCookiejar/10000000-4        	     200	 464993059 ns/op	317408985 B/op	10004883 allocs/op
BenchmarkCookiejar/100000000-4       	      10	7453667009 ns/op	9649322720 B/op	100048829 allocs/op
^Csignal: interrupt
FAIL	github.com/christianrpetrin/queue-tests	1510.242s

The memory growth is not linear to the growth of work. Also the last 1bi test took so long (5+ minutes), I had to abort the test. Impl7 doesn't suffer from these problems. I would most certainly not recommend cookiejar queue implementation until these issues are fixed.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2018

This is a worthy proposal, but, since generics are an active topic of discussion (https://go.googlesource.com/proposal/+/master/design/go2draft.md), I agree with @bcmills that we should not pursue this further until that discussion has settled. It will inevitably affect the API, and possibly the implementation, of any new container types in the standard library.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2018

Let me add that a queue implementation is just as good outside the standard library as in. There is no rush to adding queues to the standard library; people can use them effectively as an external package.

@networkimprov

This comment has been minimized.

Copy link

commented Oct 1, 2018

Specially the people who is thumbing down my proposal, I'm not seeing many comments why the proposal is not good.

Go philosophy takes a "when in doubt, leave it out" stance, so some ppl thumb-down stuff they hardly/never use. For example, notice the thumbs on this proposal to drop complex number support: #19921. From what I've read, the maintainers don't make decisions based on thumb-votes.

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 1, 2018

@ianlancetaylor agree the generics discussion should move forward. My only concern is how much longer will it take until a final decision is made and deployed? There's a clear need of a general purpose queue in the Go stdlib. Otherwise we wouldn't see so many open source queue implementations out there. Can we get some sort of timeline on when the decision on generics will be made?

Also agree a queue could be deployed to any open source repo, but please read the background piece of my proposal to help you understand why I think there's a serious need for a high perf, issue free queue in the Go stdlib. To highlight the point:

Using external packages has to be done with great care. Packages on the stdlib, however, are supposed to be really well tested and performatic, as they are validated by the whole Go community. A decision to use a stdlib package is an easy one, but to use an open source, not well disseminated implementation is much more complicated decision to make. The other big problem is how does people find out what are the best, safe to use queue implementations? As pointed by @egonelbre, this one looks like a really interesting queue, but it has hidden memory leaks. Most Go engineers doesn't have the time to validate and properly probe these queues. That's the biggest value we can provide to the whole community: a safe, performant and very easy queue to be used.

@networkimprov

This comment has been minimized.

Copy link

commented Oct 1, 2018

@ianlancetaylor how about landing a queue in golang.org/x in the interim?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2018

There is no specific timeline for generics support.

The argument about what should be in the standard library is a large one that should not take place on this issue. I'll just mention the FAQ: https://golang.org/doc/faq#x_in_std .

@Merovius

This comment has been minimized.

Copy link

commented Oct 2, 2018

but if you have read the proposal, you'd see why I believe a truly unbounded queue, an efficient one that is capable of handling large amounts of data is highly desirable.

I have read the proposal. I disagree with your reasoning. But that's not what I asked. I asked "what are realistic N that you care about?". i.e. assuming an unbounded queue is sensible - how large will it roughly be in practice? As we've established, the answer to that question determines what the best implementation is - an implementation that is better for large N will be worse for small N. So "as big as possible" just isn't a useful answer, because you end up worsening the performance for actual use-cases.

Really, you find this code complex?

Compared to append/q[1:]? Yeah.

@bcmills

This comment has been minimized.

Copy link
Member

commented Oct 2, 2018

the results are actually (I'm a bit surprised) very similar for all tests, except the last one. Pay attention to that one. That's the 1bi items test. The slice based takes ~3x as much (45s vs 145s) and uses ~3x more memory (54.0GB vs 16.3GB; true across the board for 10k+ items).

Again, I think you are focusing on too narrow a set of benchmarks. You have a solution that performs well for very large queues that are filled and then drained in one long burst, under the assumption that peak memory usage for very large queues is what matters.

It is not at all obvious to me that very large queues filled and drained once are the dominant use-case for queues in general, or that the memory usage of the queue itself is significant (compared to the memory transitively pointed to by the queue's contents). I would want to see evidence to support both of those assumptions before adding such a specifically-optimized implementation to the standard library.

(In contrast, compare sync.Map: it is also specifically optimized for certain modes of use, but those modes of use appear repeatedly within the standard library. If we didn't have standard-library call sites that needed it, sync.Map would be more appropriate as a standalone package.)

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 2, 2018

but if you have read the proposal, you'd see why I believe a truly unbounded queue, an efficient one that is capable of handling large amounts of data is highly desirable.

I have read the proposal. I disagree with your reasoning. But that's not what I asked. I asked "what are realistic N that you care about?". i.e. assuming an unbounded queue is sensible - how large will it roughly be in practice? As we've established, the answer to that question determines what the best implementation is - an implementation that is better for large N will be worse for small N. So "as big as possible" just isn't a useful answer, because you end up worsening the performance for actual use-cases.

Really, you find this code complex?

Compared to append/q[1:]? Yeah.

  1. Quoting the proposal: "As there's no telling how much data will have to be queued as it depends on the current traffic, an unbounded, dynamically growing queue is the ideal data structure to be used.".
    This means the memory used is linear to the amount of requests (i.e. active users) the hosting application is handling. Of coarse that even if techniques such as this is
    used to distribute the memory usage among many devices, under a DDOS attach for instance, the incurred cost to auto scale to a large amounts of nodes in order to be able to hold enormous amount of data (terabytes? petabytes?) is prohibitive. So what CloudLogger will do is state this problem very clearly and recommend the users to specify a hard limit of on much memory they want each CloudLogger backend server to hold as well as how many extra nodes he/she wants to auto scale to. After those user defined, specific for each system/scenario numbers are defined, CloudLogger will diligently cache the rate limit logs up to the specified limits. After that it starts to discard the logs.

  2. I think the point you are missing here is that impl7 will perform really well no matter how
    much memory it has to use because, again, of its design. Not the implementation. So in a way, impl7 is future proof. If, in a few years from now, we have computers with exabyte of memory and some "crazy" engineers decide to build systems that will use that much memory, impl7 will provide the efficiency the engineers need to handle that much data.

  3. I think what you are trying to say is that this code is slightly more complicated than this one. But in any way, shape or form, I would call this code complex. This reads to me as a incredible simple code that does an amazing job for what it is.

@networkimprov

This comment has been minimized.

Copy link

commented Oct 2, 2018

It might make sense to resume this discussion after generics lands...

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 2, 2018

It might make sense to resume this discussion after generics lands...

Ideally yes. I completely agree if we have generics in place, we'd be in a better place now. My only concern is how much longer will generics take to land. IMO, if it's going to take a couple years from now, I'd like to see this proposal make into the stdlib before that. People will be using it for years before we need to make any changes to it, which btw, I expect would be simple changes anyways.

But if you guys feel like we're really close and should be able to land generics, in say, a couple of months from now, I'm completely happy to wait.

@networkimprov

This comment has been minimized.

Copy link

commented Oct 2, 2018

#27935 (comment)

The Go gods are very deliberate.

@Merovius

This comment has been minimized.

Copy link

commented Oct 2, 2018

I think what you are trying to say is that this code is slightly more complicated than this one.

No, that's not what I am trying to say. I'm saying that both of them are unnecessarily complicated. Just declare a queue as var q []T, use q = append(q, v) to push to the queue, v, q = q[0], q[1:] to pop from it and len(q) to get its length. It's a correct, easy to understand, performant implementation of a FIFO queue.

Obviously I won't convince you of that. I still think that your benchmarks should at least include that approach to give a fair and complete picture of whether the cost of this abstraction are worth it, though.

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 2, 2018

The Go gods are very deliberate.

I get the point, but I think it's imperative we have some sort of timeline on generics. Who can guarantee generics will land in the next, say two years? What if there's so many proposals and disagreements the Go gods will have trouble agreeing on a proposal? I understand this is a super hard process. There's so much involved and so many different use cases that needs to be addressed in an efficient way. It's a really hard job and I'm afraid if we don't specify a hard timeline on the decision, a couple of years from now we could still be discussing generics. That would greatly frustrate any one who is willing to contribute because, well, their contributions will not be accepted.

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 2, 2018

As for a faster for larger number of items, here's a quick and dirty version:

BenchmarkImpl7/0-8     5000000000         1.29 ns/op       0 B/op       0 allocs/op
BenchmarkEgon/0-8      5000000000         1.93 ns/op       0 B/op       0 allocs/op
BenchmarkImpl7/1-8       50000000       108 ns/op         48 B/op       2 allocs/op
BenchmarkEgon/1-8         1000000      6626 ns/op      18432 B/op       1 allocs/op
BenchmarkImpl7/10-8       5000000       892 ns/op        600 B/op      15 allocs/op
BenchmarkEgon/10-8        1000000      6882 ns/op      18504 B/op      10 allocs/op
BenchmarkImpl7/100-8      1000000      4708 ns/op       3400 B/op     107 allocs/op
BenchmarkEgon/100-8        500000      8936 ns/op      19224 B/op     100 allocs/op
---
BenchmarkImpl7/1000-8      100000     39952 ns/op      25160 B/op    1021 allocs/op
BenchmarkEgon/1000-8       200000     30922 ns/op      26424 B/op    1000 allocs/op
BenchmarkImpl7/10000-8      10000    409220 ns/op     242760 B/op   10161 allocs/op
BenchmarkEgon/10000-8       10000    330812 ns/op     264312 B/op   10009 allocs/op
BenchmarkImpl7/100000-8      1000   4358965 ns/op    2427087 B/op  101569 allocs/op
BenchmarkEgon/100000-8       1000   3937140 ns/op    2606335 B/op  100097 allocs/op

For counts <1000 it's slower because there it's measuring construction time, rather than actual Pop/Push time. For repeated fills 1000x you should be able to see these results:

BenchmarkImpl7/0-8       3000000          1291 ns/op            0 B/op           0 allocs/op
BenchmarkEgon/0-8        2000000          1921 ns/op            0 B/op           0 allocs/op
BenchmarkImpl7/1-8         50000        108040 ns/op        48000 B/op        2000 allocs/op
BenchmarkEgon/1-8         300000         14591 ns/op        18432 B/op           1 allocs/op
BenchmarkImpl7/10-8         5000        881587 ns/op       600001 B/op       15000 allocs/op
BenchmarkEgon/10-8         20000        278252 ns/op       256320 B/op        9010 allocs/op
BenchmarkImpl7/100-8        1000       4685733 ns/op      3400007 B/op      107000 allocs/op
BenchmarkEgon/100-8         2000       2983277 ns/op      2598344 B/op       99098 allocs/op
BenchmarkImpl7/1000-8        100      40053001 ns/op     25160064 B/op     1021000 allocs/op
BenchmarkEgon/1000-8         200      28794169 ns/op     26000129 B/op      999977 allocs/op
BenchmarkImpl7/10000-8        10     407284450 ns/op    242760470 B/op    10161000 allocs/op
BenchmarkEgon/10000-8         20     299871275 ns/op    259999664 B/op    10008766 allocs/op
BenchmarkImpl7/100000-8        1    4402977600 ns/op   2427088160 B/op   101569000 allocs/op
BenchmarkEgon/100000-8         2    2822037900 ns/op   2600012728 B/op   100096660 allocs/op

I don't think my implementation is particularly good, I suspect there are bunch of ways to improve it, e.g. free blocks for stable queue size. Of course, this would slow down the simpler cases.

Note I changed test such that it uses Pop until false, instead of checking Len every time. The quick & dirty version has a slower Len to make Push and Pop faster.

@egonelbre fantastic insight! I really appreciate your time looking into this!

Below are my thoughts.

  1. Wonderful insight in testing using head == nil instead of len for an empty queue. I did actually changed impl7 to use your recommendation, but initial tests seems to improve performance a little bit for large data sets. For small, there's a small slow down. Not sure if this is significant yet.
benchstat bench1.txt bench2.txt
name            old time/op    new time/op    delta
Impl7/0-4         1.45ns ± 1%    1.48ns ± 3%  +2.29%  (p=0.000 n=20+17)
Impl7/1-4         68.6ns ± 2%    70.1ns ± 3%  +2.13%  (p=0.000 n=18+20)
Impl7/10-4         582ns ± 4%     593ns ± 5%  +1.92%  (p=0.000 n=19+18)
Impl7/100-4       3.15µs ± 5%    3.18µs ± 8%    ~     (p=0.280 n=20+19)
Impl7/1000-4      26.1µs ± 4%    26.5µs ± 7%    ~     (p=0.104 n=18+18)
Impl7/10000-4      268µs ± 3%     274µs ± 9%    ~     (p=0.496 n=19+20)
Impl7/100000-4    3.29ms ± 7%    3.17ms ± 7%  -3.53%  (p=0.009 n=16+20)

name            old alloc/op   new alloc/op   delta
Impl7/0-4          0.00B          0.00B         ~     (all equal)
Impl7/1-4          48.0B ± 0%     48.0B ± 0%    ~     (all equal)
Impl7/10-4          600B ± 0%      600B ± 0%    ~     (all equal)
Impl7/100-4       3.40kB ± 0%    3.40kB ± 0%    ~     (all equal)
Impl7/1000-4      25.2kB ± 0%    25.2kB ± 0%    ~     (all equal)
Impl7/10000-4      243kB ± 0%     243kB ± 0%    ~     (all equal)
Impl7/100000-4    2.43MB ± 0%    2.43MB ± 0%    ~     (p=0.751 n=20+20)

name            old allocs/op  new allocs/op  delta
Impl7/0-4           0.00           0.00         ~     (all equal)
Impl7/1-4           2.00 ± 0%      2.00 ± 0%    ~     (all equal)
Impl7/10-4          15.0 ± 0%      15.0 ± 0%    ~     (all equal)
Impl7/100-4          107 ± 0%       107 ± 0%    ~     (all equal)
Impl7/1000-4       1.02k ± 0%     1.02k ± 0%    ~     (all equal)
Impl7/10000-4      10.2k ± 0%     10.2k ± 0%    ~     (all equal)
Impl7/100000-4      102k ± 0%      102k ± 0%    ~     (all equal)

For the fun part, why would checking for head instead of len be faster?

I would assume using head would be faster because of the high rate of reads and writes to the len variable coupled with the constant need to sync back its value from the the CPU registers and hardware caches (https://en.wikipedia.org/wiki/Cache_(computing)#Examples_of_hardware_caches) back to the main memory, would cause some sort of slow down. Head does suffer from the same problem, but because its value changes only every 129 items removed, the rate in which its value would be synced back to the main memory is n/129, where 129 is the number of items that needs to be removed from the queue before head is updated (to the queue next slice). The write rate for the len variable is linear to the number of items removed from the queue.

Still, your suggested change has a lot of theoretical potential to improve performance in some high load scenarios, so I checked it in already, but I'll keep testing to try to better understand its implications.

  1. Can you please share the test code you used for the "repeated fills 1000x" test?

  2. I see some strange memory patterns with your solution. Imp7 memory patterns grows linearly almost perfectly with the growth of work, doesn't matter the load. I think this might be happening because your Pop doesn't set the next pointer to nil after that slice have been used up. Take a look at line 137 of impl7 here. Not setting the next pointer to nil before "releasing" (losing the reference to the slice) causes it to still points to the new head slice, which I suspect makes GC thinks that memory address is still being used. I suspect if you fix your code to set the next to nil, you should see a performance drop on your benchmarks.

  3. I see your are using fairly large internal slice sizes (1024). Based on the slice size tests, I was also able to validate the larger the internal slice size, the better the results are for large data sets. The problem with larger slice sizes is the potential for larger amounts of allocated but not used data (see "Drawbacks" on the proposal). So 128 is a compromise between best performance for very large data sets vs wasted memory consumption.

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 2, 2018

the results are actually (I'm a bit surprised) very similar for all tests, except the last one. Pay attention to that one. That's the 1bi items test. The slice based takes ~3x as much (45s vs 145s) and uses ~3x more memory (54.0GB vs 16.3GB; true across the board for 10k+ items).

Again, I think you are focusing on too narrow a set of benchmarks. You have a solution that performs will for very large queues that are filled and then drained in one long burst, under the assumption that peak memory usage for very large queues is what matters.

It is not at all obvious to me that very large queues filled and drained once are the dominant use-case for queues in general, or that the memory usage of the queue itself is significant (compared to the memory transitively pointed to by the queue's contents). I would want to see evidence to support both of those assumptions before adding such a specifically-optimized implementation to the standard library.

(In contrast, compare sync.Map: it is also specifically optimized for certain modes of use, but those modes of use appear repeatedly within the standard library. If we didn't have standard-library call sites that needed it, sync.Map would be more appropriate as a standalone package.)

@bcmills, it makes a lot of sense to me! I'm also highly interested in finding out any scenarios where impl7 would not perform well, so I'll wholeheartedly implement any test suggestions.

As of right now I have a feeling there's room for improvements on repeated refills, but we'll see.

What about below tests. Would it be enough to help you understand the queue performance?

Test 1: On each test iteration:

  • Pick a random number between 1 and 10k
  • Add that number of items to the queue
  • Remove all the items from the queue
  • Repeat 10 times, reusing the same queue instance

Test 2: On each test iteration:

  • Pick a random number between 1 and 10k
  • Add that number of items to the queue
  • Pick another random number between 1 and the initial random number
  • Remove that number of items from the queue
  • Repeat 10 times, reusing the same queue instance
@bcmills

This comment has been minimized.

Copy link
Member

commented Oct 2, 2018

What about below tests. Would it be enough to help you understand the queue performance?

No, I don't think that really helps. Tests of random inputs are not necessarily representative of real programs.

If you're talking about a general-purpose queue library, the best way to gauge its performance impact is to find some real programs that use queues in the wild, and benchmark them with various queue implementations (including typed slices, []interface{}, your chunked-list implementation, a two-array deque, and probably a naive linked-list using container/list) substituted in for whatever they're using today.

If the profile numbers don't change with the implementation, then you may as well use whatever is simplest (which, today, is usually the slice approach that @Merovius described). If they do change with the implementation, then you'll be able to see which implementation works best for which kinds of programs.

@egonelbre

This comment has been minimized.

Copy link
Contributor

commented Oct 2, 2018

  1. The nil check is not that important. The small win comes from avoiding len altogether.

  2. Using different computer so quickly recreated it https://gist.github.com/egonelbre/bffa9e02282f6623fc1b331a9515feee. Also added two additional variations where CookieJar might do well. Tracking single free-block for stable should suffice. But as @bcmills said, you need real-world benchmarks not just synthetic ones.

  3. Will review tomorrow.

  4. Hence the reason I called it "quick & dirty", I didn't bother tweaking parameters nor the design for optimality.

Although, I probably won't dwell on this issue too long afterwards.

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 2, 2018

@bcmills @egonelbre

Here's how I believe such task to gather real world data for queues should be carried out.

  1. Find a few, preferably at least three real world, production Go systems that uses queues in some shape or form. All systems should use the queue with different purposes/uses cases.
  2. For each one of the systems
    2.1) Update it to expose metrics about the queue usage such as number of items added, removed,
    rate of add to removal, minimum number of items in the queue, maximum number of items in the queue, usage pattern, among other possible useful metrics
    2.2) Update it to use typed slices as a queue
    2.3) Deploy to production
    2.4) Let it run with production traffic for at least a full day
    2.5) Gather and analyze the results
    2.6) Repeat steps 2.2 to 2.5 for:
    • []interface{} queue
    • My chunked-list implementation
    • A two-array deque
    • A naive linked-list using container/list
  3. Aggregate and analyze all results
  4. Write a document explaining all results detailing each use case, special uses, etc

I don't have access to any production system anymore, but even if I had, imagine how hard it would be to me to convince the system owners to let me play around with their system which would include deploying test code to production. Worse is, we know, based on the synthetic tests, some of those solutions will very likely not perform well, hurting the hosting system performance.

Besides the nearly impossible task to find those real world production systems that would be willing to help, the time it would take me to run all the tests, gather and analyze results is impractical for me. I can't afford to spend a entire month doing this. I do have a bunch of ideas, besides the queue one, which I believe will help make Go a better language. I want to move on and focus on the next proposals and improvements, and not linger on this one for God knows how much longer.

I urge you guys to make your decision based on:

  1. The queue design and the theory behind it
  2. The queue implementation
  3. The (synthetic) benchmark tests

There's so much more to this queue than just the code. I see a lot of challenges to the code, but none about the queue foundation. The theory and design should be the ones mostly scrutinized, way more than the implementation. But I still didn't see a single challenge to the queue foundation.

If you do have any specific concerns regarding the theory, the design or the implementation that you feel like would make the queue not perform competitively, please point out what your concerns are. I need real examples in order to help validate the threat of those concerns. Please point me out to a specific line of code, for instance, in the implementation. State why you feel like that line could pose problems, in which scenario and why.

Another workable solution would be for you to point out exactly which synthetic test cases you would like to see. I can very quickly code those up and run the tests.

@egonelbre

This comment has been minimized.

Copy link
Contributor

commented Oct 3, 2018

Took a look at the nil thing you mentioned, it's being cleared https://github.com/egonelbre/exp/blob/059fd158c18a79639ee31a5f50b1b8c07df72e52/queue/blockqueue/queue.go#L68.

It should be sufficient to find a number of programs that need to use such high-performance queue (there are a lot of examples). Demonstrate that queue indeed is the right thing they need. Show that by using this queue improves their performance significantly (above 5%). Using production systems would be of course ideal, but lack of them using local benchmarks should suffice. I've have recorded all the instructions and then just played back them on different data-structures, to see the performance difference.

With regards to implementation my main concerns are about: 1. allocation amount, 2. GC pressure, 3. use interfaces (but this would require generics or unsafe), 4. concurrency (I cannot remember the last time I needed a basic-non-concurrent-unbounded-queue).

With regards to which direction to take this. I would recommend making a proper package out of it. Get feedback on the implementation from golang-nuts forum and performance channel in Slack, you won't get a lot of feedback here. Also recommend making a blog post about the analysis, because it contains good examples on how to analyse a implementation and you will get more comments on how it can be improved.

This means that:

  1. there's a public queue package that people can start using today,
  2. you can get real world feedback,
  3. when there eventually is a decision whether to include this, something else or no data-structure package at all, then this implementation already has had real-world usage and
  4. the implementation doesn't get stuck in this issue.

Either way, I'm withdrawing from the discussion as it needs to wait for decisions whether a unbounded-non-concurrent-queue makes sense to be included in the Go repositories at all.

@Merovius

This comment has been minimized.

Copy link

commented Oct 3, 2018

There's so much more to this queue than just the code. I see a lot of challenges to the code, but none about the queue foundation. The theory and design should be the ones mostly scrutinized, way more than the implementation. But I still didn't see a single challenge to the queue foundation.

I don't understand how you came to that conclusion. ISTM that the arguments brought against this proposal are a) unbounded queues are a bad idea in general, b) that adding another generic container while generics in Go are actively discussed is not a good idea and c) that there is little benefit to a generic queue, if you can get sufficient (or better) performance using a naive implementation. These seem pretty foundational to me - in particular, I don't see a single comment that is related to the actual code. You might not agree with these arguments, but they are there.

[edit] oh, there's of course also d) that a generic queue is just as good outside the stdlib as inside it. Which also has nothing to do with the code.

@cznic

This comment has been minimized.

Copy link
Contributor

commented Oct 3, 2018

On the performance of queue push

package main

import (
	"fmt"
	"runtime"
	"runtime/debug"
	"testing"

	"github.com/christianrpetrin/queue-tests/queueimpl7"
)

var (
	q7 *queueimpl7.Queueimpl7
	q  []int
)

func testQ(n int) {
	q = nil
	for i := 0; i < n; i++ {
		q = append(q, i) // Push
	}
}

func testQ7(n int) {
	q7 = queueimpl7.New()
	for i := 0; i < n; i++ {
		q7.Push(i)
	}
}

func test(f func(int), tag string) {
	var stats0, stats runtime.MemStats
	q = nil
	q7 = nil
	debug.FreeOSMemory()
	runtime.ReadMemStats(&stats0)
	result := testing.Benchmark(func(b *testing.B) { f(b.N) })
	debug.FreeOSMemory()
	runtime.ReadMemStats(&stats)
	fmt.Printf("%10s %v memory used %10v\n", tag, result, stats.Alloc-stats0.Alloc)
}

func main() {
	test(testQ, "[]int")
	test(testQ7, "Queueimpl7")
}

Playground (Does not run on playground)

16GB machine, i5-4670 CPU @ 3.40GHz × 4:

==== jnml@4670:~/src/github.com/cznic/issue27935> go version
go version go1.11.1 linux/amd64
==== jnml@4670:~/src/github.com/cznic/issue27935> ./issue27935 
     []int 100000000	        12.4 ns/op memory used  987122112
Queueimpl7 50000000	        31.3 ns/op memory used 1212501008
==== jnml@4670:~/src/github.com/cznic/issue27935> 

Wrt the 'memory used' value: The Queueimpl7 has in the end only half as much items as the []int slice.

edit: Added memory note

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 4, 2018

On the performance of queue push

package main

import (
	"fmt"
	"runtime"
	"runtime/debug"
	"testing"

	"github.com/christianrpetrin/queue-tests/queueimpl7"
)

var (
	q7 *queueimpl7.Queueimpl7
	q  []int
)

func testQ(n int) {
	q = nil
	for i := 0; i < n; i++ {
		q = append(q, i) // Push
	}
}

func testQ7(n int) {
	q7 = queueimpl7.New()
	for i := 0; i < n; i++ {
		q7.Push(i)
	}
}

func test(f func(int), tag string) {
	var stats0, stats runtime.MemStats
	q = nil
	q7 = nil
	debug.FreeOSMemory()
	runtime.ReadMemStats(&stats0)
	result := testing.Benchmark(func(b *testing.B) { f(b.N) })
	debug.FreeOSMemory()
	runtime.ReadMemStats(&stats)
	fmt.Printf("%10s %v memory used %10v\n", tag, result, stats.Alloc-stats0.Alloc)
}

func main() {
	test(testQ, "[]int")
	test(testQ7, "Queueimpl7")
}

Playground (Does not run on playground)

16GB machine, i5-4670 CPU @ 3.40GHz × 4:

==== jnml@4670:~/src/github.com/cznic/issue27935> go version
go version go1.11.1 linux/amd64
==== jnml@4670:~/src/github.com/cznic/issue27935> ./issue27935 
     []int 100000000	        12.4 ns/op memory used  987122112
Queueimpl7 50000000	        31.3 ns/op memory used 1212501008
==== jnml@4670:~/src/github.com/cznic/issue27935> 

Wrt the 'memory used' value: The Queueimpl7 has in the end only half as much items as the []int slice.

edit: Added memory note

Nice test @cznic!

  1. You are falling in this trap. This is not an apples to apples comparison
  2. Unfortunately a queue where the data only gets pushed and none gets removed should not have many real world applications. Full lifecycle tests should be more representative of real world use cases
@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 4, 2018

@cznic @Merovius @egonelbre @bcmills You guys are tough! I love tough guys! Your challenges and contributions so far has been amazing! :-)

In seems to me that in the same way you can't convince me the queue design (see "Design Considerations") is favorable for any pure slice based queue implementation, I won't be able to convince you impl7 deserves to be part of the stdlib. In any case, your opinion is highly respected and appreciated!

Now out of all the challenges to the proposal and the queue, below are what I believe the most relevant points made against the queue proposal and the queue itself.

  1. There's some clear edge cases (here, here) where the queue actually performs worse than pure slice based implementations for small and mid-sized data sets

  2. The queue could perform as well outside of the stdlib as an independent package as well inside it (here, here)

  3. It's easy to build a performant queue using a simple slice. There's no need for a specialized implementation (here)

  4. There's so many different queue designs and implementations, so we should not include a queue in the stdlib (here)

  5. Queue should not be included in the stdlib until generics gets sorted out (here, here, here, here, here)

  6. The queue needs real world data in order to help validate if, and how much would it improve the performance of real world applications (here and here)

  7. Extending, growing the stdlib means more maintenance work (of my own)

  8. There no real use case for an unbounded, not safe for concurrent use queue (here)

Below are my thoughts on each point.

  1. Even with edge cases like this, impl7 still consumes ~1/3 of the memory used by the pure slice based implementation. For more real world like cases, impl7 is much faster or very competitive in virtually all scenarios

  2. Completely agree, but please consider these counter arguments

  3. Refer to the same comment here as well how I was able to help @egonelbre fix his queue implementation here

  4. The proposal includes benchmarking the proposed solution against 6 other experimental
    queue designs and implementations, as well the standard list package and a few external
    queue implementations such as phf, gammazero, juju. Other suggested queue implementation was tested as well such as @bcmills proposed queue implementation
    as well cookiejar. The benchmark tests are still very favorable for impl7 when comparing to all tested queues.

  5. Completely agree with it, but please refer to this and this regarding my concerns

  6. I don't think this is a practical suggestion. Please refer here for details

  7. Consider that I plan to keep not only using, but helping to contribute to Go. I plan to propose many other suggestions to Go and very much would like to get involved into helping to maintain Go as well by helping with small bug fixes, validating other proposals, etc. By helping me helping you, you'll find yet another member that is willing to help you not only extend, but also to maintain Go. So in a way, you are not getting more work. It's less because I'll be also helping. So please consider the healthy of the community as well. If you guys are so conservative for accepting new contributions, current and future contributors will probably think twice before proposing any contributions.

  8. CloudLogger has a very clear case for that. Also quoting Wikipedia: "Some common ADTs, which have proved useful in a great variety of applications, are". Queue features prominently on the list. Now even if we can't see ourselves many real world use cases for a queue, please consider when Thomas Edison invented the light bulb; or when Henry Ford invented the car; initially a lot of people thought: we don't need light bulbs or cars; we're completely fine with our candles and horses!. To me the new design represents a significant step forward in queue designs, one that will last for the foreseeable future because the design does not make any assumption around how much data the struct will hold. The design dictates the implementation will perform well no matter what,
    differently from pure slice based implementations that will only perform well with small to mid-sized data sets (refer to all pure slice based implementations such as "BenchmarkImpl1" and "BenchmarkImpl2"). "Build useful things and people will find all sorts of uses for it.".

To make it clear, I truly think slice is a fantastic data structure. It's extremely flexible and was so brilliantly well implemented in Go that I had a really hard time coming up with a design and implementation that would be able to do better for a queue implementation, even though slices, as-is,
are not really the best data structure for a queue. So don't fail to see impl7 IS a slice based implementation. What all the benchmark tests are testing, for the most part, how the Go slice implementation works for small (128) sized slices. There's really no special, brand new directive or data structure in impl7.

Thanks very much @egonelbre to help validate the design of the queue. Your implementation of the proposed design looks very good and I'm very glad I was able to help you get it right. This kind of contribution, where the actual theory and design is validated through independent implementations is most welcome! Also thanks very much for the insight into using head vs len. I look forward to working with you in mine (and yours!) next proposals!

Also a big thanks to @Merovius. Fantastic edge case you pointed out. I think this is a great case study for the design of generics. Also a big thanks to @cznic to have actively engaged in the discussion.

In my experience, often the biggest critics are also the biggest contributors. You guys did a fantastic job validating the queue. Thanks very much for your contribution to the queue and to the discussion.

As for the people who expressed support for the queue and the proposal, I can't be more appreciative of your time and support. Your support is highly appreciated!

Also a big thanks for the thumps up folks! :-)

I really can't afford to spend much more time on this discussion. I'll certainly keep an eye on the issue, but very much like to get CloudLogger out and due to this proposal, I'm already 2 weeks (and counting) late now. I also have a few other ideas that I think has enormous potential. I'll keep investigating them on my spare time and once I'm able to reach a final design and am able to prove the design works, I'll submit proposals. Stay tuned!

In any case, I look forward to hear what's the final decision on the matter. If the proposal is accepted, of coarse I'll invest whatever time is needed to get it fully tested and into stdlib.

I'll keep an eye on the issue, so please keep commenting and expressing your support/concerns. I just don't plan to keep looking at the issue every 5 minutes moving forward. :-)

@egonelbre

This comment has been minimized.

Copy link
Contributor

commented Oct 4, 2018

  1. I didn't say there aren't use cases, just that I haven't seen one lately.
  2. My code didn't have a bug, it was already clearing things properly, I merely pointed out the line where it was being done. Your conclusion about memory leaking was wrong.
@Merovius

This comment has been minimized.

Copy link

commented Oct 5, 2018

CloudLogger has a very clear case for that.

Just to pick this one out (it applies to all the other things too): We simply disagree on this. I think your use case is a traditional example of why unbounded queues are not a good idea. They seem to offer convenience and simplicity on the surface, but in practice lead to cascading failures. Fundamentally, an unbounded queue is only useful, if the consumer is slower than the producer - at least at times. But given that an unbounded queue - in particular an in-memory one - is never actually unbounded, if you don't actually handle that situation (e.g. by backpressure or dropping messages), the only thing they get you is that you invariably end in a situation where a failure of the consumer leads to a failure of the queue service, which leads to a failure of the producer. I don't think this math can really be changed - either queues can be bounded, or they contribute to cascading failure scenarios.

I didn't actually argue about this so far, because a) I don't think there can be an argument, because ultimately neither the assertion that unbounded queues are useful nor the assertion that they aren't can be verified or falsified and b) because I don't think it really matters - even assuming they are useful, it's IMO still not a good idea to have a generic version in the stdlib (for all the reasons mentioned).

I'm just trying to clarify that I understand that you feel you have good arguments on your side. Which is fair. But that doesn't mean we can't still disagree. Just as you feel you have good counters to all my arguments, I feel to have good counters to all your arguments :)

I really can't afford to spend much more time on this discussion. I'll certainly keep an eye on the issue, but very much like to get CloudLogger out and due to this proposal, I'm already 2 weeks (and counting) late now. I also have a few other ideas that I think has enournous potential. I'll keep investigating them on my spare time and once I'm able to reach a final design and am able to prove the design works, I'll submit proposals. Stay tuned!

Looking forward to it. FWIW, I believe you put a bit more work into this proposal than necessary. It's admirable that you took the effort to explain your arguments in detail and benchmark many different implementations against each other. But I, at least, would have found it just as useful to just present your proposed implementation as an example of a generic implementation that has different asymptotic behavior than a naive slice queue. The actual, detailed implementation questions can be sorted out after general agreement is reached that we want this package in the stdlib.

Not saying that to discourage you, but to maybe help you tuning this tradeoff in your future proposals. Making the proposal simpler in the beginning and fleshing them out after the general ideas are accepted can only safe you time :)

@cznic

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2018

// Copyright 2018 The Issue27935 Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

import (
	"fmt"
	"runtime"
	"runtime/debug"
	"testing"

	"github.com/christianrpetrin/queue-tests/queueimpl7"
)

var (
	q7   *queueimpl7.Queueimpl7
	q    []int
	tmpi interface{}
	tmp  int
	tmp2 bool
)

func testQ(b *testing.B, items *int, stats *runtime.MemStats) {
	q = nil
	for n := 0; n < b.N; n++ {
		q = append(q, n) // Push
		if n%3 == 0 {
			tmp = q[0]
			q = q[1:]
		}
	}
	b.StopTimer()
	*items = len(q)
	debug.FreeOSMemory()
	runtime.ReadMemStats(stats)
	b.StartTimer()
	for len(q) != 0 {
		tmp = q[0]
		q = q[1:]
	}
}

func testQ7(b *testing.B, items *int, stats *runtime.MemStats) {
	q7 = queueimpl7.New()
	for n := 0; n < b.N; n++ {
		q7.Push(n)
		if n%3 == 0 {
			tmpi, tmp2 = q7.Pop()
		}
	}
	b.StopTimer()
	*items = q7.Len()
	debug.FreeOSMemory()
	runtime.ReadMemStats(stats)
	b.StartTimer()
	for q7.Len() != 0 {
		tmpi, tmp2 = q7.Pop()
	}
}

func test(f func(*testing.B, *int, *runtime.MemStats), tag string) {
	var stats0, stats runtime.MemStats
	var items int
	q = nil
	q7 = nil
	debug.FreeOSMemory()
	runtime.ReadMemStats(&stats0)
	result := testing.Benchmark(func(b *testing.B) { f(b, &items, &stats) })
	memUsed := stats.Alloc - stats0.Alloc
	fmt.Printf("%10s: %9v ops, %4.1f ns/op, memory used %9v, items %7v, %5.2f B/item\n",
		tag, result.N, float64(result.T)/float64(result.N), memUsed, items, float64(memUsed)/float64(items),
	)
}

func main() {
	test(testQ, "[]int")
	test(testQ7, "Queueimpl7")
}
==== jnml@4670:~/src/github.com/cznic/issue27935> ./issue27935 
     []int: 100000000 ops, 15.8 ns/op, memory used 595890688, items 66666666,  8.94 B/item
Queueimpl7:  50000000 ops, 38.0 ns/op, memory used 808335248, items 33333333, 24.25 B/item
==== jnml@4670:~/src/github.com/cznic/issue27935> 
@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 9, 2018

This is the edge case identified and talked about (here and here). We hope generics will fix this. For a more real world like scenario, please refer to the performance of impl1 as impl1 uses "interface{}" as its data type instead of int (much more flexible).

@Merovius

This comment has been minimized.

Copy link

commented Oct 9, 2018

For a more real world like scenario

I disagree with the characterization of interface{} as "more real world like". I don't think there is anything more realistic about pointer-shaped values over non-pointer-shaped ones - on the contrary, actually. I can't think of a good reason why things in a queue should be pointers. ISTM they will, in general, represent plain data.

much more flexible

The point is specifically (and it would be helpful if you'd acknowledge that instead of dismissing it as an "edge case") that this flexibility incurs a cost, in the form of more allocations and indirections. The generic queue implementation is slower. And FTR, I predict that generics won't change that either (though we'll have to see that once they are implemented).

@rsc

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2018

At the least, this proposal should be marked as "on hold for generics". But really I think this should be done outside the standard library and prove its necessity first. We have not been creating new packages from scratch in the standard library for quite some time.

@rsc rsc changed the title Proposal: Built in support for high performance unbounded queue proposal: add container/queue Oct 10, 2018

@rsc rsc added the Proposal-Hold label Oct 10, 2018

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Oct 11, 2018

@rsc I agree we should wait for the generics discussion to take place. In the meantime, I'll try to validate the proposed solution as much as possible by pinging golang-nuts, making a proper external package, etc. The biggest problem about exposing the queue as an external package only is how to bring awareness about it. Marketing is a gigantic problem that I can't solve by myself. It will probably take years before the package really takes off and people start using it and report their experiences.

In any case, I'll keep an eye and try to join the generics discussion as well.

Thanks for taking a look at the proposal!

@christianrpetrin

This comment has been minimized.

Copy link
Author

commented Nov 26, 2018

Given the many suggestions, I have built a new queue and deployed it as a proper package. The new queue is based on impl7 and is actually a deque now. Appreciate the suggestions (@phf, @bcmills) as a deque is indeed a much more flexible data structure that can be used not only as a FIFO queue, but also as LIFO stack and as a deque with mixed push/pop.

The new deque is a considerably more advanced data structure than impl7 due to the fact is now a deque and also because it has a new design.

The new design leverages the linked slices design by making it an auto shrinking ring. The new design improved impl7 performance and efficiency quite dramatically in most tests, although it is now a bit slower on the fill tests. Refer here for the design details.

ns/op

I have also implemented many new new unit, integration, API and especially, benchmark tests. @egonelbre, much appreciated for the test suggestions and to benchmark cookiejar's queue.

Given the need for real world usage data, I need to somehow get people to start using it. I'll do my best to raise awareness about the deque, but this is an area I really need help. If you are using a queue/stack/deque, please consider switching to the new deque. Also if you know someone who is using a queue/stack/deque, please let them know about the deque.

Below is a few select tests of deque vs impl7.

deque vs impl7 - FIFO queue - fill tests

benchstat testdata/BenchmarkFillDequeQueue.txt testdata/BenchmarkFillImpl7Queue.txt
name        old time/op    new time/op    delta
/0-4          39.9ns ± 9%    35.9ns ± 3%   -9.88%  (p=0.000 n=10+9)
/1-4           142ns ± 1%     133ns ± 1%   -6.61%  (p=0.000 n=10+10)
/10-4          636ns ± 1%     764ns ± 7%  +20.26%  (p=0.000 n=9+9)
/100-4        4.74µs ± 3%    4.28µs ± 3%   -9.67%  (p=0.000 n=9+9)
/1000-4       43.0µs ±23%    38.8µs ± 7%   -9.84%  (p=0.004 n=10+10)
/10000-4       450µs ±19%     388µs ± 5%  -13.70%  (p=0.001 n=10+10)
/100000-4     4.24ms ± 4%    3.95ms ± 2%   -6.84%  (p=0.000 n=10+8)
/1000000-4    46.8ms ± 1%    45.9ms ± 4%   -1.87%  (p=0.008 n=8+9)

name        old alloc/op   new alloc/op   delta
/0-4           64.0B ± 0%     48.0B ± 0%  -25.00%  (p=0.000 n=10+10)
/1-4            144B ± 0%      112B ± 0%  -22.22%  (p=0.000 n=10+10)
/10-4           608B ± 0%      736B ± 0%  +21.05%  (p=0.000 n=10+10)
/100-4        6.19kB ± 0%    4.26kB ± 0%  -31.27%  (p=0.000 n=10+10)
/1000-4       33.0kB ± 0%    33.2kB ± 0%   +0.58%  (p=0.000 n=10+10)
/10000-4       322kB ± 0%     323kB ± 0%   +0.23%  (p=0.000 n=10+10)
/100000-4     3.22MB ± 0%    3.23MB ± 0%   +0.20%  (p=0.000 n=10+10)
/1000000-4    32.2MB ± 0%    32.3MB ± 0%   +0.19%  (p=0.000 n=10+9)

name        old allocs/op  new allocs/op  delta
/0-4            1.00 ± 0%      1.00 ± 0%     ~     (all equal)
/1-4            4.00 ± 0%      4.00 ± 0%     ~     (all equal)
/10-4           15.0 ± 0%      17.0 ± 0%  +13.33%  (p=0.000 n=10+10)
/100-4           107 ± 0%       109 ± 0%   +1.87%  (p=0.000 n=10+10)
/1000-4        1.01k ± 0%     1.02k ± 0%   +0.99%  (p=0.000 n=10+10)
/10000-4       10.1k ± 0%     10.2k ± 0%   +0.79%  (p=0.000 n=10+10)
/100000-4       101k ± 0%      102k ± 0%   +0.78%  (p=0.000 n=10+10)
/1000000-4     1.01M ± 0%     1.02M ± 0%   +0.78%  (p=0.000 n=10+10)

deque vs impl7 - FIFO queue - refill tests

benchstat testdata/BenchmarkRefillDequeQueue.txt testdata/BenchmarkRefillImpl7Queue.txt
name       old time/op    new time/op    delta
/1-4         3.79µs ± 1%   10.03µs ± 6%  +165.06%  (p=0.000 n=9+10)
/10-4        37.8µs ± 7%    74.7µs ± 5%   +97.58%  (p=0.000 n=10+10)
/100-4        361µs ± 7%     442µs ± 4%   +22.25%  (p=0.000 n=9+10)
/1000-4      3.75ms ± 4%    3.97ms ± 3%    +6.10%  (p=0.000 n=10+9)
/10000-4     36.5ms ± 3%    39.1ms ± 6%    +7.17%  (p=0.000 n=10+10)
/100000-4     380ms ± 1%     421ms ±24%   +10.67%  (p=0.000 n=10+9)

name       old alloc/op   new alloc/op   delta
/1-4         1.60kB ± 0%    6.40kB ± 0%  +300.00%  (p=0.000 n=10+10)
/10-4        16.0kB ± 0%    68.8kB ± 0%  +330.00%  (p=0.000 n=10+10)
/100-4        160kB ± 0%     421kB ± 0%  +163.00%  (p=0.000 n=8+10)
/1000-4      2.42MB ± 0%    3.32MB ± 0%   +37.30%  (p=0.000 n=10+10)
/10000-4     17.0MB ± 0%    32.3MB ± 0%   +89.36%  (p=0.000 n=8+10)
/100000-4     162MB ± 0%     323MB ± 0%   +99.52%  (p=0.000 n=8+8)

name       old allocs/op  new allocs/op  delta
/1-4            100 ± 0%       300 ± 0%  +200.00%  (p=0.000 n=10+10)
/10-4         1.00k ± 0%     1.60k ± 0%   +60.00%  (p=0.000 n=10+10)
/100-4        10.0k ± 0%     10.8k ± 0%    +8.00%  (p=0.000 n=10+10)
/1000-4        100k ± 0%      102k ± 0%    +1.80%  (p=0.000 n=10+10)
/10000-4      1.00M ± 0%     1.02M ± 0%    +1.57%  (p=0.000 n=8+10)
/100000-4     10.0M ± 0%     10.2M ± 0%    +1.56%  (p=0.000 n=10+10)

Although impl7 marginally outperforms deque on the fill test, deque outperforms impl7
on all other tests by large margins in most cases, especially when it comes to memory usage.
The refill and (below) stable are just two examples.

deque vs impl7 - FIFO queue - stable tests

benchstat testdata/BenchmarkStableDequeQueue.txt testdata/BenchmarkStableImpl7Queue.txt
name        old time/op    new time/op    delta
/1-4          34.8ns ± 1%    38.1ns ± 5%    +9.48%  (p=0.000 n=10+9)
/10-4          353ns ± 2%     391ns ± 7%   +10.92%  (p=0.000 n=10+10)
/100-4        3.45µs ± 1%    4.13µs ±19%   +19.82%  (p=0.000 n=10+10)
/1000-4       34.5µs ± 1%    39.6µs ±12%   +14.95%  (p=0.000 n=10+10)
/10000-4       346µs ± 2%     385µs ± 4%   +11.40%  (p=0.000 n=10+10)
/100000-4     3.43ms ± 1%    3.85ms ±11%   +12.11%  (p=0.000 n=10+10)
/1000000-4    34.4ms ± 1%    38.3ms ± 6%   +11.13%  (p=0.000 n=10+10)

name        old alloc/op   new alloc/op   delta
/1-4           16.0B ± 0%     32.0B ± 0%  +100.00%  (p=0.000 n=10+10)
/10-4           160B ± 0%      322B ± 0%  +101.25%  (p=0.000 n=10+10)
/100-4        1.60kB ± 0%    3.23kB ± 0%  +101.56%  (p=0.000 n=10+10)
/1000-4       16.0kB ± 0%    32.2kB ± 0%  +101.56%  (p=0.000 n=10+10)
/10000-4       160kB ± 0%     322kB ± 0%  +101.56%  (p=0.000 n=10+10)
/100000-4     1.60MB ± 0%    3.23MB ± 0%  +101.56%  (p=0.000 n=10+10)
/1000000-4    16.0MB ± 0%    32.3MB ± 0%  +101.56%  (p=0.000 n=10+10)

name        old allocs/op  new allocs/op  delta
/1-4            1.00 ± 0%      1.00 ± 0%      ~     (all equal)
/10-4           10.0 ± 0%      10.0 ± 0%      ~     (all equal)
/100-4           100 ± 0%       101 ± 0%    +1.00%  (p=0.000 n=10+10)
/1000-4        1.00k ± 0%     1.01k ± 0%    +1.50%  (p=0.000 n=10+10)
/10000-4       10.0k ± 0%     10.2k ± 0%    +1.56%  (p=0.000 n=10+10)
/100000-4       100k ± 0%      102k ± 0%    +1.56%  (p=0.000 n=10+10)
/1000000-4     1.00M ± 0%     1.02M ± 0%    +1.56%  (p=0.000 n=10+10)

Refer here for all benchmark tests.

The new package is production ready and I have released its first version. It can be found here.

Regarding the suggestions for more tests (here, here) and to gather real world data, @bcmills, @egonelbre, does the new benchmark tests help address your concerns?

Regarding the questions whether a FIFO queue is actually useful (here,here), @egonelbre, @rsc, does a deque help address your concerns? Please consider a deque not only can be used as a FIFO queue, but also as a LIFO stack as well as with mixed push/pop.

Regarding the concerns whether Go needs a specialized queue or just using a simple slice as a queue is good enough (here), @Merovius, does the new queue performance vs slice helps address your concerns? In the wake of all the new benchmark tests and the much improved deque design, do you still feel like a deque naive implementation that uses a slice such as CustomSliceQueue is the best option?

@tv42

This comment has been minimized.

Copy link

commented Jun 9, 2019

Marketing is a gigantic problem that I can't solve by myself.

I'm an outsider to this conversation, but please take this point: the Go stdlib (or golang.org/x) should not be a marketing mechanism for your code; especially not for its clever implementation. That line of thinking can only decrease the stdlib's technical quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.