Added 'resize' operation to BoundedQueue #1949

jpkrohling · 2019-11-28T13:19:32Z

Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de

Which problem is this PR solving?

This PR is a prerequisite for Memory-sensitive collector queue size #943

Short description of the changes

This adds a new Resize(int) method to the BoundedQueue

codecov · 2019-11-28T13:41:02Z

Codecov Report

Merging #1949 into master will decrease coverage by 0.05%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1949      +/-   ##
==========================================
- Coverage   96.99%   96.94%   -0.06%     
==========================================
  Files         203      203              
  Lines       10064    10082      +18     
==========================================
+ Hits         9762     9774      +12     
- Misses        264      269       +5     
- Partials       38       39       +1

Impacted Files	Coverage Δ
pkg/queue/bounded_queue.go	`94.44% <100%> (-5.56%)`	⬇️
cmd/query/app/static_handler.go	`86.84% <0%> (-1.76%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46d7f17...bec2db4. Read the comment docs.

yurishkuro · 2019-11-28T18:58:28Z

how is it supposed to be used? who would call the Resize()? Isn't it possible to just pre-determine the queue size upfront?

jpkrohling · 2019-11-29T08:19:32Z

how is it supposed to be used? who would call the Resize()?

A change to the span processor will make it keep track of the number of spans and the total bytes it has seen during the process lifetime. Once enough spans have been recorded, the span processor starts to calculate what's the ideal queue size based on the average number of spans and available memory. It can then decide to trigger a Resize() operation.

I have a local branch that does the above, but I thought it would be cleaner to separate those two features into separate PRs.

Isn't it possible to just pre-determine the queue size upfront?

When we know what's the expected average span size, yes. But I've not heard of any cases where people know in advance what's the average span size, making the queue size hard to calculate.

jpkrohling · 2019-11-29T10:57:49Z

I just added a benchmark to the tests, to make ensure that I'm not bringing performance problems to the queue. ~~Here are the results comparing master with this PR, which shows no performance degradation for current users of the queue (unless I'm making something really wrong with the benchmark):~~

I'm updating the benchmark results, I was having some data races when reading the len/cap of the channel and changed to use a RWMutex. I ran the benchmark a few times as I'm getting a big standard deviation.

Master:

$ go test -benchmem -run=^$ ./pkg/queue -bench "^(BenchmarkBoundedQueue)$" -test.benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/jaegertracing/jaeger/pkg/queue
BenchmarkBoundedQueue-8   	56784301	       136 ns/op	       8 B/op	       1 allocs/op
PASS
ok  	github.com/jaegertracing/jaeger/pkg/queue	7.813s

$ go test -benchmem -run=^$ ./pkg/queue -bench "^(BenchmarkBoundedQueue)$" -test.benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/jaegertracing/jaeger/pkg/queue
BenchmarkBoundedQueue-8   	57453411	        90.4 ns/op	       8 B/op	       1 allocs/op
PASS
ok  	github.com/jaegertracing/jaeger/pkg/queue	5.303s

$ go test -benchmem -run=^$ ./pkg/queue -bench "^(BenchmarkBoundedQueue)$" -test.benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/jaegertracing/jaeger/pkg/queue
BenchmarkBoundedQueue-8   	80754200	       275 ns/op	       8 B/op	       1 allocs/op
PASS
ok  	github.com/jaegertracing/jaeger/pkg/queue	26.512s

This PR, which seems more consistent:

$ go test -benchmem -run=^$ ./pkg/queue -bench "^(BenchmarkBoundedQueue)$" -test.benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/jaegertracing/jaeger/pkg/queue
BenchmarkBoundedQueue-8   	18307092	       330 ns/op	       8 B/op	       1 allocs/op
PASS
ok  	github.com/jaegertracing/jaeger/pkg/queue	6.388s

$ go test -benchmem -run=^$ ./pkg/queue -bench "^(BenchmarkBoundedQueue)$" -test.benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/jaegertracing/jaeger/pkg/queue
BenchmarkBoundedQueue-8   	17270967	       337 ns/op	       8 B/op	       1 allocs/op
PASS
ok  	github.com/jaegertracing/jaeger/pkg/queue	6.175s

$ go test -benchmem -run=^$ ./pkg/queue -bench "^(BenchmarkBoundedQueue)$" -test.benchtime 5s
goos: linux
goarch: amd64
pkg: github.com/jaegertracing/jaeger/pkg/queue
BenchmarkBoundedQueue-8   	17288391	       316 ns/op	       8 B/op	       1 allocs/op
PASS
ok  	github.com/jaegertracing/jaeger/pkg/queue	5.828s

yurishkuro · 2019-11-29T16:46:18Z

I am skeptical about relying on the average span size, because the actual size differences can be very large, 2-3 orders of magnitude, e.g a basic span vs. span with a stack trace vs. span with request payload added to the logs. I think if we want to go by memory size, we should go by the actual size of spans in the queue. The proto/domain model already implements Size() method iirc, which could be used as an approximation of the actual memory size, and we would need to re-implement the queue using an explicit buffer rather than a bound channel (might be worth looking around for an existing implementation).

jpkrohling · 2019-11-29T19:58:20Z

the actual size differences can be very large, 2-3 orders of magnitude, e.g a basic span vs. span with a stack trace vs. span with request payload added to the logs

You are right, but in reality, this wouldn't be a big problem. The queue sizes will typically be big enough to make outliers non-relevant. In my working branch for #943, spans generated by tracegen have an average of 1.5KiB, which means a queue size of about 700k items at 1GiB. And even if spans start being bigger all of a sudden, the periodic background process will readjust the queue size as long as the difference is bigger than a few % points. Note that the "memory" aspect is an approximation: there shouldn't be a crash if we end up using more memory for a short period of time. And most of the times, the queue should be empty anyway ;-)

I would give this simple approach a try first and see how it works in the real world.

jpkrohling · 2019-12-04T10:32:08Z

@yurishkuro any feedback on this one?

yurishkuro · 2019-12-09T04:45:25Z

pkg/queue/bounded_queue.go

+	q.resizeLock.Lock()
+	defer q.resizeLock.Unlock()
+
+	new := make(chan interface{}, newCap)


what if newCap is less than currentCap and current size? This will probably deadlock

please don't use new keyword as var name

Funny, I thought I had added a test for the downscale, but looks like I haven't. Will add one.

yurishkuro · 2019-12-10T00:33:55Z

I wonder if it's possible to avoid the mutex by approaching this differently. Go routines are cheap to start, so rather than copying the messages into another queue, we could start another set of workers reading from a new channel, close the original channel and let those old workers drain it naturally and exit.

jpkrohling · 2019-12-10T11:00:53Z

That could work, let me give it a try.

jpkrohling · 2019-12-11T13:02:19Z

PR updated, to use a new channel for the resized queue.

jpkrohling · 2019-12-17T19:26:55Z

@yurishkuro would you please re-review?

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

pavolloffay · 2019-12-19T10:14:59Z

pkg/queue/bounded_queue.go

+	// we might have two concurrent backing queues at the moment
+	// their combined size is stored in q.size, and their combined capacity
+	// should match the capacity of the new queue
+	if atomic.LoadInt32(&q.size) >= int32(q.Capacity()) && q.Capacity() > 0 {


s/atomic.LoadInt32(&q.size)/int32(q.Size()) to keep consistent with the capacity

I am not sure If I understand the && q.Capacity() > 0 isn't the capacity always bigger than 0, does it matter in this if?

There are consumers of this queue that are not setting the queue size. IIRC, one of them is in the span processor tests. Agree with your earlier suggestion though.

I see there is a test which sets the size to 0

pavolloffay · 2019-12-19T11:06:47Z

pkg/queue/bounded_queue_test.go

+
+	released, resized := false, false
+	q.StartConsumers(1, func(item interface{}) {
+		if !resized { // we'll have a second consumer once the queue is resized


I didn't find the second consumer

Around line 200:

jaeger/pkg/queue/bounded_queue_test.go

Lines 197 to 200 in a7819b8

resized = true

assert.True(t, q.Resize(4))

assert.True(t, q.Produce("e")) // in process by the second consumer

secondConsumer.Wait()

pavolloffay · 2019-12-19T11:08:37Z

pkg/queue/bounded_queue_test.go

+	released := false
+	q.StartConsumers(1, func(item interface{}) {
+		// once we release the lock, we might end up with multiple calls to reach this
+		if !released {


Released does not seem to be needed

Why's that? Note that there's quite a bunch of go funcs just waiting for this lock to be released to start calling consumers with other items. If we have not released, we mark one step down in the work group. If we have released the consumer, we do not want to count down further, as we'd reach a negative count.

I mean you can remove this parameter altogether. It's not changed during the test - only at the end which does not affect tests and the default value satisfy this if statement.

Removal of this variable does not have any effect on the tests. It seems redundant

It needs to be there, otherwise the test will intermittently fail due to "negative counts" in the work group.

pavolloffay · 2019-12-19T11:12:37Z

pkg/queue/bounded_queue_test.go

+
+	assert.EqualValues(t, 2, q.Capacity())
+	assert.EqualValues(t, 4, q.Size())      // the queue will eventually drain, but it will live for a while over capacity
+	assert.EqualValues(t, 0, len(*q.items)) // the new queue is empty, as the old queue is still full and over capacity


Shouldn't we keep accepting here? There are free spots in the queue.

There are free spots in the individual channel, but if the queue size was set by the user to 3000, it's wrong to have 6000 items in the two queues combined.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

pavolloffay

LGTM, Let's see how this will be applied in the collector. But that is part of a separate issue.

jpkrohling · 2019-12-19T15:04:54Z

Thanks!

pkg/queue/bounded_queue.go

yurishkuro · 2019-12-21T00:48:48Z

pkg/queue/bounded_queue.go

 	size          int32
 	onDroppedItem func(item interface{})
-	items         chan interface{}
+	items         *chan interface{}


This is being updated with a CAS in one place but accessed directly in other places, which is technically not thread safe. I would rather we used a normal mutex or atomic.Value.

pkg/queue/bounded_queue.go

jpkrohling requested review from black-adder, objectiser, pavolloffay, tiffon, vprithvi and yurishkuro as code owners November 28, 2019 13:19

jpkrohling changed the title ~~Added 'resize' operation to BoundedQueue~~ WIP - Added 'resize' operation to BoundedQueue Nov 28, 2019

jpkrohling changed the title ~~WIP - Added 'resize' operation to BoundedQueue~~ Added 'resize' operation to BoundedQueue Nov 28, 2019

0XFF-96 approved these changes Nov 29, 2019

View reviewed changes

yurishkuro reviewed Dec 9, 2019

View reviewed changes

Added 'resize' operation to BoundedQueue

a7819b8

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

pavolloffay reviewed Dec 19, 2019

View reviewed changes

Changes based on the review

bec2db4

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>

pavolloffay approved these changes Dec 19, 2019

View reviewed changes

jpkrohling merged commit 3fa8a08 into jaegertracing:master Dec 19, 2019

yurishkuro reviewed Dec 21, 2019

View reviewed changes

jpkrohling mentioned this pull request Jan 7, 2020

Follow-up on review comments for the bounded queue #2008

Merged

jpkrohling deleted the Resize-bounded-queue branch July 28, 2021 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added 'resize' operation to BoundedQueue #1949

Added 'resize' operation to BoundedQueue #1949

jpkrohling commented Nov 28, 2019

codecov bot commented Nov 28, 2019 •

edited

yurishkuro commented Nov 28, 2019

jpkrohling commented Nov 29, 2019

jpkrohling commented Nov 29, 2019 •

edited

yurishkuro commented Nov 29, 2019

jpkrohling commented Nov 29, 2019

jpkrohling commented Dec 4, 2019

yurishkuro Dec 9, 2019

jpkrohling Dec 9, 2019

yurishkuro commented Dec 10, 2019

jpkrohling commented Dec 10, 2019

jpkrohling commented Dec 11, 2019

jpkrohling commented Dec 17, 2019

pavolloffay Dec 19, 2019

pavolloffay Dec 19, 2019

jpkrohling Dec 19, 2019

pavolloffay Dec 19, 2019

pavolloffay Dec 19, 2019

jpkrohling Dec 19, 2019

pavolloffay Dec 19, 2019

jpkrohling Dec 19, 2019

pavolloffay Dec 19, 2019

pavolloffay Dec 19, 2019

jpkrohling Dec 19, 2019

pavolloffay Dec 19, 2019

jpkrohling Dec 19, 2019

pavolloffay left a comment

jpkrohling commented Dec 19, 2019

yurishkuro Dec 21, 2019

	resized = true
	assert.True(t, q.Resize(4))
	assert.True(t, q.Produce("e")) // in process by the second consumer
	secondConsumer.Wait()

Added 'resize' operation to BoundedQueue #1949

Added 'resize' operation to BoundedQueue #1949

Conversation

jpkrohling commented Nov 28, 2019

Which problem is this PR solving?

Short description of the changes

codecov bot commented Nov 28, 2019 • edited

Codecov Report

yurishkuro commented Nov 28, 2019

jpkrohling commented Nov 29, 2019

jpkrohling commented Nov 29, 2019 • edited

yurishkuro commented Nov 29, 2019

jpkrohling commented Nov 29, 2019

jpkrohling commented Dec 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurishkuro commented Dec 10, 2019

jpkrohling commented Dec 10, 2019

jpkrohling commented Dec 11, 2019

jpkrohling commented Dec 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavolloffay left a comment

Choose a reason for hiding this comment

jpkrohling commented Dec 19, 2019

Choose a reason for hiding this comment

codecov bot commented Nov 28, 2019 •

edited

jpkrohling commented Nov 29, 2019 •

edited