Fix undefined switch logic when reporter is closed #239
Conversation
Signed-off-by: Won Jun Jang <wjang@uber.com>
Codecov Report
@@ Coverage Diff @@
## master #239 +/- ##
=========================================
+ Coverage 83.24% 83.3% +0.06%
=========================================
Files 51 51
Lines 2691 2696 +5
=========================================
+ Hits 2240 2246 +6
Misses 326 326
+ Partials 125 124 -1
Continue to review full report at Codecov.
|
@alexeykudinkin can you take a look? |
reporter.go
Outdated
@@ -233,6 +235,12 @@ func (r *remoteReporter) Close() { | |||
// reporting new spans. | |||
func (r *remoteReporter) processQueue() { | |||
timer := time.NewTicker(r.bufferFlushInterval) | |||
close := func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge fan of assigning to built-in function name close
. Also, can't you just do defer func() {...}()
in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, defer func() {...}()
should do it
reporter.go
Outdated
@@ -233,6 +235,12 @@ func (r *remoteReporter) Close() { | |||
// reporting new spans. | |||
func (r *remoteReporter) processQueue() { | |||
timer := time.NewTicker(r.bufferFlushInterval) | |||
close := func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, defer func() {...}()
should do it
|
||
// After that, drain the queue | ||
// Cut off report requests still in-flight and drain the queue | ||
// NB we keep r.queue open because it causes a race condition in Report() where since both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not close r.queue
when exiting processQueue()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we run into the same race condition in Report()
@black-adder i think we can do it much simpler and just order the checks, splitting the select in 2:
into
|
I still think we could potentially run into a race condition (I maybe completely wrong). I'm inlining the original Close() function and the newly suggested Report() function to show a potential race condition (although highly unlikely):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@black-adder you're right, this very well might happen. If we absolutely need to close the r.queue
we need to have the same reportersDrained
wait group which would allow to drain all of the reports in-flight before closing the r.queue
reporter.go
Outdated
@@ -247,12 +255,12 @@ func (r *remoteReporter) processQueue() { | |||
r.metrics.ReporterQueueLength.Update(atomic.LoadInt64(&r.queueLength)) | |||
} | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a dead code, given that we're not closing r.queue
anymore
Signed-off-by: Won Jun Jang <wjang@uber.com>
I don't really see need to close the queue channel apart from being a bad "citizen" |
reporter.go
Outdated
@@ -246,13 +248,12 @@ func (r *remoteReporter) processQueue() { | |||
// to reduce the number of gauge stats, we only emit queue length on flush | |||
r.metrics.ReporterQueueLength.Update(atomic.LoadInt64(&r.queueLength)) | |||
} | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need to read ok
in this case, can remove if ok
fwiw, the Ticker we're using for periodic flushes also doesn't close the channel when you close the ticker. |
I somehow introduced some flakiness, will address |
Signed-off-by: Won Jun Jang <wjang@uber.com>
Signed-off-by: Won Jun Jang <wjang@uber.com>
Signed-off-by: Won Jun Jang <wjang@uber.com>
Signed-off-by: Won Jun Jang <wjang@uber.com>
Hmm, I figured out the cause of the flakiness. The flakiness fix has made me reconsider the changes this PR made. essentially what's happening was:
From a user's perspective, this should save span1 and it's understandable that span2 isn't saved. However with the changes in this PR, span1 isn't guaranteed to be persisted. I might have to add a mutex around the whole class which is unfortunate... |
can you explain why span1 may not be saved? the close() call has a waitgroup that ensures that the queue is drained. |
@yurishkuro b/c of what i was talking about: |
@alexeykudinkin IDK what you mean. |
@isaachier |
My mistake. Do you think adding a wait group to Report would alleviate the problem? |
I'm giving up on this solution and gonna just add a mutex: https://github.com/jaegertracing/jaeger-client-go/pull/240/files |
@black-adder putting mutex will make every |
I was just looking at lightstep implementation, it's not using channel for the span queue, but an explicit buffer protected by a mutex (mutex on the tracer, which allows things like checking if it's closed). The side effect is that they don't have eager flushing when the packet size is reached, instead they run a flush loop at 500ms frequency. |
@yurishkuro did the same for C++ client but I feel it is very "un-Go" to avoid the channel altogether. |
Channels are not some free magic, they are typically more expensive than a similar explicit code with a mutex. Their benefit is their use in Anyway, I am not suggesting we change the implementation completely. We could simplify the logic by using just a single queue with a command pattern. That is, instead of
This way we don't need to close the queue from the Close() method, we just send a command with
|
@black-adder @yurishkuro addressed in #241 |
There's a race condition in remoteReporter.Report() where if both r.closed and r.queue channels are closed, the switch stmt will pick one arbitrarily causing a panic if Report is called post close. This PR keeps r.queue open so that even if somehow the queue is filled after close() is called, it will never be flushed. An atomic flag might've worked here but didn't want to add extra sync given the channels already do that for us.