Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LogstashTcpSocketAppender may loose events in its onEvent() method #329

Closed
brenuart opened this issue Apr 10, 2019 · 5 comments · Fixed by #649
Closed

LogstashTcpSocketAppender may loose events in its onEvent() method #329

brenuart opened this issue Apr 10, 2019 · 5 comments · Fixed by #649
Milestone

Comments

@brenuart
Copy link
Collaborator

It seems the LogstashTcpSocketAppender may loose events in its onEvent() method if it fails to send it for 5 times in a row. It may for instance fail to send the event because the connection has been dropped by the remote peer or the encoder failed to encode it.

In case of a poison event (encoder failed to process it), then it is indeed safer to discard the event and proceed with the next one. In this case there is no need to retry 5 times - it can be discarded immediately. And has a comment in the code says, there is no need to reopen the socket as it won't help in this case...

But when the failure is caused by a broken connection, I think the appender should not drop the event but should rather keep trying to send it until it succeed. Think about a scenario where the remote peer drops the connection immediately after it is established. In this case, the onSend() method will quickly iterate over the 5 allowed attempts and drop the message!

What do you think ?

@philsttr
Copy link
Collaborator

Ok, so, we would need a way to categorize exceptions into two types:
A. exceptions where we should reopen the connection and re-send event
B. exceptions where we should drop the event (probably should still reopen the connection for future events in case the event was partially-sent, and the stream corrupted)

How do you propose categorizing these two exception types?

Retrying indefinitely will help in a short term blip. By "short term" I mean an amount of time that is short enough so that the ringbuffer does not fill completely with log events. If the connection problems last long enough so that the ring buffer fills up completely, then log events will be dropped when they are attempted to be inserted in the ring buffer.

So, for long term connection problems, changing the behavior won't really help, since events will get dropped anyway. It's just a matter of which events get dropped.

However, for short term connection problems, changing the behavior will be an improvement.

@brenuart
Copy link
Collaborator Author

brenuart commented Apr 16, 2019 via email

@brenuart
Copy link
Collaborator Author

brenuart commented Jun 25, 2019

I'm back, was a very long week... ;-)

Here is an idea:

  • wrap the socket output stream with a wrapper whose purpose is to detect if the write operation threw an exception.
  • in catch section of TcpSendingEventHandler, check whether the wrapper detected an exception. If it did, we know the problem is in the "network" layer and we trigger a reconnection. If it did not, then the exception was thrown by the encoder and there is nothing much we can do about it besides dropping the "poison" event.

We can then safely remove the MAX_REPEAT_WRITE_ATTEMPTS from the loop and be sure that nothing but poison events will be dropped on "this side" of the buffer.

@brenuart
Copy link
Collaborator Author

brenuart commented Jul 19, 2019

Want me to submit a PR with this approach ?

@philsttr
Copy link
Collaborator

The approach seems reasonable to me. And yes, PRs are always welcome. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants