Investigate how to limit backlog on Elasticsearch collector #1760

codefromthecrypt · 2017-10-07T02:09:19Z

Our elasticsearch span consumer uses okhttp under the scenes, which is subject to a default of 64 in-flight connections. We should think about a cancelation policy/algo so that we don't OOM when the elasticsearch cluster gets backed up for whatever reason. I don't see a control to limit the backlog in the okhttp dispatcher, so we might have to do something here..

here are notes from @Logic-32 #1332 (comment)

We are using Brave with a URLConnectionSender to send the data. I'm surprised that no tool could handle more than 10k spans in your workshop; we handle it with minimal problems on only one zipkin-server, no cluster. The only problem is that the ES_READ_TIMEOUT is set to 60s and it sometimes takes more than 60s to write the data out. So, the data does get written but zipkin erroneously reports them as dropped. But our specific situation that got us into that state was having a 10s messageTimeout and a tight loop that hit the database; both of which have been remedied.

Quick review of the documentation indicates that the max connections only affects the size of the pool they use; which probably means everything else would be queued and backlogged.

Again, not knowing OkHttp that well, I wonder if you can inspect the Dispatcher queue and cancel anything in the queue if you exceed a certain threshold?
Ref: https://gist.github.com/RyanRamchandar/64c5863838940ec67f03

codefromthecrypt · 2017-10-07T02:15:08Z

One way we could do it is to drop the incoming requests when we notice dispatcher.runningCallsCount() is at a specific threshold, maybe easy as our in-flight limit (default 64)!

This trusts that auto-cancelation based on connect/read/write timeouts work (which defaults to 10s). If there's a possibility of stale in-flight requests otherwise, I guess we'd need something to watchdog in-flight requests to kill them if they are over the various timeouts.

cc the usual okhttp folks
@yschimke @swankjesse @JakeWharton

yschimke · 2017-10-07T05:01:18Z

From the Gist but also knowing the EventListener interface, is this something that could be prototyped first as an external library. Seems like all the hooks are there to come up with something pretty smart.

swankjesse · 2017-10-07T08:35:06Z

Another strategy: have your own queue that sits in front of the dispatcher and does its own prioritization and enqueueing.

Getting signaled when dispatcher’s size changes may be tricky.

yschimke · 2017-10-07T09:07:24Z

This sort of thing is hard to get right without knowing how the client and server(s) are handling things. Will OkHttpClient ever make the optimal decision? e.g. is it worth submitting a queued call shortly before timeout, and therefore newer calls should be prioritised over older ones.

codefromthecrypt · 2017-10-07T09:17:25Z

There are two main cases in ES anyway, the server fails not nicely (ex out of disk) and the server fails nicely (response that says backoff). The formal ES client has a backoff strategy for the latter. The former can still happen and did in the issue leading to this one. Regardless, one simplification we can make here is that we can choose to drop new requests or drop in-flight ones on overload. In either case, and we don't need to buffer in expectation of a transient failure. For example, first priority is not dying (OOM), though later more advanced things could be done.

Logic-32 · 2017-10-09T19:46:39Z

@yschimke, looking at the EventListener interface, I'm not quite sure it'd be able to accomplish what we'd need by it self. Specifically, this line from the documentation makes me a bit uneasy:

All event methods must execute fast, without external locking, cannot throw exceptions, attempt to mutate the event parameters, or be reentrant back into the client.

The inability for a listener to modify the client being the phrase to note there. Maybe an EventListener in conjunction with a queue handling mechanism in front of OkHttp would work but that sounds like a lot of hoops to jump through and equally as many places for something to go wrong.

@swankjesse, what would the benefit of putting a queue in front of OkHttp be (aside from what I mentioned above)? Would querying the Dispatcher directly be inappropriate in some way or are you just suggesting that an extra abstraction might lead to a cleaner implementation?

@adriancole, the StorageComponent code is a bit deep and I haven't managed to get through it all yet, particularly to the ES client, but for reference, there was another situation where we hit an OOM: when one of our ES nodes went down. I wasn't able to identify and exact order of events that caused the issue but for some reason after the node (1 out of 3 of them) went offline we had a backup of requests and spiked to an OOM. That indicates to me that I'm guessing the server was not failing nicely and triggering the backoff strategy you mentioned. Again, not sure what other factors were at play but it's another scenario to consider at the very least.

As for the simplification: that sounds ideal to me as long as proper metrics are maintained with respect to number of spans dropped ;)

codefromthecrypt · 2017-10-10T00:49:59Z

Honestly, I think we should start with dropping on backlog and go from there. The test is easy.. crap out a cluster and throw traffic at zipkin. uncrap the cluster and the drop metrics should stop.

codefromthecrypt · 2017-10-10T00:53:51Z

downside of dropping on backlog (ex ready queue > 0 or some figure defaulting to zero) is that we will still have a by default 64 requests in flight on a crap cluster. however, the change to improve that (ex coordinate on http response) could layer a filter or other code to throttle back. If 64 requests OOM the server anyway, you can change that to a lower value today anyway. At any rate a healthy backoff signal isn't always present, so regardless we need a knob like drop on backlog as otherwise whoever's queue it is will lead to OOM.

In the past, delayed or otherwise unhealthy elasticsearch clusters could create a backlog leading to a OOM arising from the http dispatcher ready queue. This chooses to prevent a ready queue instead. This means we drop spans when the backend isn't responding instead of crashing the server. Fixes #1760

codefromthecrypt · 2017-10-10T12:34:40Z

#1765 I am soak testing this, but the server already survives a lot longer. basically, set ES_MAX_REQUESTS=2 and throw lots traffic at it. Before this change, it would reliably OOM. It is harder to OOM if you have the default max requests of 64 :P

In the past, delayed or otherwise unhealthy elasticsearch clusters could create a backlog leading to a OOM arising from the http dispatcher ready queue. This chooses to prevent a ready queue instead. This means we drop spans when the backend isn't responding instead of crashing the server. Fixes #1760

codefromthecrypt · 2017-10-10T14:11:56Z

let's try simple first. I've put a semaphore in, and zipkin 2.2 has some nice things for grafana etc. give it a try and let's iterate from there..

here's an example of a forced surge and recovery though I'm sure we can kill the server if we try hard enough.

Logic-32 · 2017-10-13T23:30:42Z

Thanks! As previously mentioned, transition to zipkin 2 will be a little involved for us but I will report back when I can!

Hardest part with this is that the death is random. So who knows if we'll find the random event again ;)

codefromthecrypt · 2017-10-14T01:31:45Z

Hardest part with this is that the death is random. So who knows if we'll find the random event again ;)

well thanks for dying earlier, as I think the code is a bit safer now, even if not bulletproof. cheers!

codefromthecrypt · 2019-05-08T02:14:58Z

#2502 now includes test instructions please give a try

In the past, delayed or otherwise unhealthy elasticsearch clusters could create a backlog leading to a OOM arising from the http dispatcher ready queue. This chooses to prevent a ready queue instead. This means we drop spans when the backend isn't responding instead of crashing the server. Fixes openzipkin#1760

codefromthecrypt mentioned this issue Oct 7, 2017

OutOfMemoryError encountered when reading from Kafka #1332

Closed

codefromthecrypt mentioned this issue Oct 10, 2017

Prevents backlog when writing to Elasticsearch #1765

Merged

codefromthecrypt closed this as completed in #1765 Oct 10, 2017

Logic-32 mentioned this issue Jul 6, 2018

Track number of active requests #2125

Closed

This was referenced Apr 2, 2019

Spans lost with kafka and elasticsearch (over capacity) #2023

Open

Buffer some requests in order to reduce "over capacity" errors without also killing Elasticsearch #2481

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate how to limit backlog on Elasticsearch collector #1760

Investigate how to limit backlog on Elasticsearch collector #1760

codefromthecrypt commented Oct 7, 2017

codefromthecrypt commented Oct 7, 2017 •

edited

yschimke commented Oct 7, 2017 •

edited

swankjesse commented Oct 7, 2017

yschimke commented Oct 7, 2017 •

edited

codefromthecrypt commented Oct 7, 2017 via email

Logic-32 commented Oct 9, 2017

codefromthecrypt commented Oct 10, 2017 via email

codefromthecrypt commented Oct 10, 2017 via email

codefromthecrypt commented Oct 10, 2017

codefromthecrypt commented Oct 10, 2017

Logic-32 commented Oct 13, 2017

codefromthecrypt commented Oct 14, 2017 via email

codefromthecrypt commented May 8, 2019

Investigate how to limit backlog on Elasticsearch collector #1760

Investigate how to limit backlog on Elasticsearch collector #1760

Comments

codefromthecrypt commented Oct 7, 2017

codefromthecrypt commented Oct 7, 2017 • edited

yschimke commented Oct 7, 2017 • edited

swankjesse commented Oct 7, 2017

yschimke commented Oct 7, 2017 • edited

codefromthecrypt commented Oct 7, 2017 via email

Logic-32 commented Oct 9, 2017

codefromthecrypt commented Oct 10, 2017 via email

codefromthecrypt commented Oct 10, 2017 via email

codefromthecrypt commented Oct 10, 2017

codefromthecrypt commented Oct 10, 2017

Logic-32 commented Oct 13, 2017

codefromthecrypt commented Oct 14, 2017 via email

codefromthecrypt commented May 8, 2019

codefromthecrypt commented Oct 7, 2017 •

edited

yschimke commented Oct 7, 2017 •

edited

yschimke commented Oct 7, 2017 •

edited