Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the pipeline is stuck, the lumberjack input can make logstash run out of memory #10

Closed
ph opened this issue Apr 23, 2015 · 7 comments
Assignees

Comments

@ph
Copy link
Contributor

ph commented Apr 23, 2015

Like discussed in elastic/logstash#3003.

When the back pressure is applied to the pipeline up to the lumberjack input, the connection threads will block. On the producer side, Logstash-forwarder will never receive an ack message for the blocked payload and he will assume the connection had a timeout.

The behavior of LSF is to reconnect on timeout and try to resend the unacknowledged frames to logstash. The input will accept this new connection but will block on the queue. LSF will retry forever to send the message to logstash and logstash will go OOM, crawling under the number of connection attempts.

The first goal here is to implement a threadpool to limit the number of connection an input can create and refuse any new connection when we don't have any ressources left.

@ph
Copy link
Contributor Author

ph commented Apr 23, 2015

Another solution we could implement at the plugin level is to have a small buffered queues of events inside the lumberjack plugin that support a timeout for the lock, so this queue could be a small broker between the input and the SizedQueue.

@ph
Copy link
Contributor Author

ph commented Apr 23, 2015

@colinsurprenant's new persisted queue could also reduce that pressure on the input side.

Could you see some problems of adding some sort of timeout mechanism to your persisted queue?
This could help to actually apply the back pressure on the producer side for networked inputs.

@driskell
Copy link

In Log Courier I implemented SizedQueue with timeouts, and partial ACK to prevent nearly all timeouts. Works extremely well.

Though I think the partial ack is not going to be backwards compatible without versioning of some sort because forwarder does not validate acks.
elastic/logstash-forwarder#180 is an old patch for partial ack that may guide, but the timeout::timeout caused more trouble to be fair (launches a thread and could race)

Using the persisted queue with timeouts would be great I think.

Thought I'd let you know as it might help with discovery :)

@ph
Copy link
Contributor Author

ph commented Apr 23, 2015

I still believe threadpool and timeout would be great to have, ty @driskell for some reference

@driskell
Copy link

Sure I was not attempting to change your plans if it came across that way! They are great.

Was pointing out that timeout has been worked on before in case it helps as reference. Timeout is biggest win I think (it was for me) and its natural progression (with even bigger win) is partial ack - means nobody has to mess around with timeout settings.

Love the work you guys do. I'll leave you to get on with it however you decide 👍

@ph
Copy link
Contributor Author

ph commented Apr 23, 2015

@driskell I might have been a bit direct in my last reply. I am sorry.

Let me do a bit of explanation.

The threadpool/timeout will help a bit for the current OOM problem, I agree this is not the golden solution.
But before doing any major changes with the current protocol we need to do a few things:

  1. Extract the lumberjack gem outside of the LSF, so we can iterate on it more quickly.
  2. Add more test to it, currently this gem is lacking on testing and is a bit hard to add tests in some places.
  3. Make it more resilient to common TCP problems.
  4. Add more logging and visibility in what lumberjack is doing.
  5. Add resiliency on the protocol layer, that could mean keep-alive, partial ack, etc. (I will surely check your PR/comments and we can probably iterate on it)

All those things is only to improve the user experience and resiliency of the whole stack.

@ph
Copy link
Contributor Author

ph commented Aug 21, 2015

SizeQueue and Circuitbreaker implemented in the lumberjack input.

@ph ph closed this as completed Aug 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants