Skip to content
This repository has been archived by the owner on Jun 29, 2023. It is now read-only.

Corrupted gelf messages with TCP sender #96

Closed
wants to merge 1 commit into from
Closed

Corrupted gelf messages with TCP sender #96

wants to merge 1 commit into from

Conversation

koeff
Copy link

@koeff koeff commented Oct 5, 2016

On high loads some log messages are corrupted or lost. Writed reproducable test.

Environment: java version "1.8.0_102"; logstash-gelf-1.10.0; logback 1.1.7; graylog.

I'm using logback with logstash-gelf-1.10.0 appender, configured using tcp protocol.
Problem is still reproducable on logstash-gelf-1.11.0-SNAPSHOT.

Logback.xml appender config:

<appender name="GELF" class="biz.paluch.logging.gelf.logback.GelfLogbackAppender">
    <host>tcp:${graylogHost}</host>
    <port>12201</port>
    <version>1.1</version>
    <facility>${appName}</facility>

    <extractStackTrace>true</extractStackTrace>
    <filterStackTrace>true</filterStackTrace>

    <timestampPattern>yyyy-MM-dd_HH:mm:ss.SSS</timestampPattern>
    <maximumMessageSize>8192</maximumMessageSize>
</appender>

GelfTcpSender flushing gelf message buffer

maven source plugin
@@ -208,6 +208,19 @@
</executions>
</plugin>

<plugin>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for this change? maven-source-plugin is pulled in through the parent pom and executed on release builds.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there is no reason to this change. May be ignored.

@mp911de
Copy link
Owner

mp911de commented Oct 5, 2016

Thanks for your PR. Could you help me understand the issue? GelfTCPSender was synchronized once but that lead to performance issues for some users. I currently do not understand why synchronization could solve the issue. SocketChannelImpl synchronizes writes internally (see SocketChannelImpl.write).

Reviewing my code, I found AbstractNioSender.channel is not volatile so a synchronization might resolve visibility issues but adding volatile to AbstractNioSender.channel is required anyways to resolve possible visibility issues with the other senders.

I'm also not sure whether SO_SNDBUF might be an issue. If the send buffer is full it won't accept any data. I'm not sure what happens in such case, the socket could either block or throw an IOException. If it's the exception, then the socket gets closed and reopened what could cause package loss.

@koeff
Copy link
Author

koeff commented Oct 5, 2016

Ok, I'll try to explain the problem (sorry for my English).
I used logstash-gelf in one project, and sometimes there are errors in the graylog's logs:
2016-09-22T10:24:10.039+03:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=56e8539a-8089-11e6-ae01-002590da0246, journalOffset=-9223372036854775808, codec=gelf, payloadSize=1298, timestamp=2016-09-22T05:56:43.832Z, remoteAddress=/192.168.1.12:43523} com.fasterxml.jackson.core.JsonParseException: Unexpected character ('_' (code 95)): was expecting a colon to separate field name and value at [Source: {"_Severity":"INFO","short_message":"...

So I made research. sniffed traffic from application (which used logstash-gelf) and found, that some gelf messages are broken. This occurs then application sends a lot of messages in a short time. Usually it looked like one incomplete gelf message and next complete gelf message like this:
{"_Severity":"INFO","short_messa{"_Severity":"INFO","short_message":"...<complete gelf message>}

Yes. SocketChannelImpl synchronizes writes internally. But GelfTCPSender uses socketChannel in non-blocking mode (socketChannel.configureBlocking(false);). Javadoc of write method says: "...Some types of channels, depending upon their state, may write only some of the bytes or possibly none at all. A socket channel in non-blocking mode, for example, cannot write any more bytes than are free in the socket's output buffer."
socketChannel.write(byteBuffer) - in non-blocking mode it does not guarantee, that all buffer data will be written.

Writing to socketChannel in loop doesn't solve multiple thread log writers problem. Example: one thread can write a piece of gelf message, then another thread writes own piece of gelf message and then first thread completes writing it gelf message - this two messages blended, stream of gelf messages brokes. To separate multiple gelf writers I used synchronization. We cann't start writing next gelf message to socket before complete writing previous gelf message.

If the send buffer is full it won't accept any data. I'm not sure what happens in such case, the socket could either block or throw an IOException

In non-blocking mode nothing happens. We write as much bytes as possible, and when send buffer fills GelfTCPSender doesn't try to write remaining data in the buffer.

In this PR there is one possible issue fix, most importantly - the idea, may be there is any better solution.

@mp911de
Copy link
Owner

mp911de commented Oct 5, 2016

Awesome, thanks for the explanation. Don't worry about your English, it's really good.

So the solution is to make sure that all bytes from the buffer are written and the buffer does not contain any remaining bytes? Message loss could then also happen if the buffer is smaller than the message (say 100 bytes output buffer and 110 message bytes). I will do further investigation on this issue to find out how to fix all affected senders.

@mp911de mp911de added the type: bug A general bug label Oct 5, 2016
@koeff
Copy link
Author

koeff commented Oct 5, 2016

Yes.
First thing is checking (in non-blocking mode) that all data was written to socket and byteBuffer does not contain remaining bytes.

And second thing - synchronization multiple threads, so only one thread writes gelf message. Another threads should not start writing to a socket, while first thread is not finished writing it's gelf message.

Loss of messages due to buffer size - is at least predictable behavior. We just need to remember about this option.

@koeff
Copy link
Author

koeff commented Oct 6, 2016

Hello!
Tried disabling socketChannel non-blocking mode.
So there is no need checking byteBuffer remaining bytes while writing to socket and no need synchronizing multiple threads, which write to same socket.

Here is new branch (socketSendBufferOverflowTest remained the same): https://github.com/koeff/logstash-gelf/tree/send-buffer-overflow-2

Is there any advantages of using non-blocking socket channel here? We don't use selectors here.

@mp911de
Copy link
Owner

mp911de commented Oct 6, 2016

There's a reason for non-blocking mode: SocketChannel may get disconnected but that's not visible to the application yet (see #88). The sender invokes a read() operation to either retrieve data from the socket buffer or perform a socket read. Broken sockets will throw IOException which is a signal to reconnect. In non-blocking mode, read() with no data does not block.

@mp911de
Copy link
Owner

mp911de commented Oct 6, 2016

I took a look on your work and adopted it into ecddbfd. Could you give the changes a try? If so, then I'd merge the changes to master.

@mp911de mp911de added this to the logstash-gelf 1.11.0 milestone Oct 6, 2016
@koeff
Copy link
Author

koeff commented Oct 6, 2016

Reviewed. This commit fixes corrupting messages if there is only one log writer.
There are corrupted gelf messages if there are multiple log writers.

mp911de added a commit that referenced this pull request Oct 6, 2016
A high write volume can cause package loss when using TCP or UDP senders. This is because of saturated send buffers. NIO channels don't block/continue writing if send buffers are full but return from the call. Inspecting the remaining buffer size is a good measure to determine whether the buffer was written entirely or whether bytes to write are left. Add synchronization to prevent GELF message interleaving.

TCP and UDP senders now retry writes until the message buffer is written entirely to the Channel (send buffer).

See also PR #96.
@mp911de
Copy link
Owner

mp911de commented Oct 6, 2016

Thanks a lot for your kind help to discover and fix multiple issues. I'm closing this PR in favor of a slightly different fix.

@mp911de mp911de closed this Oct 6, 2016
@koeff koeff deleted the send-buffer-overflow branch October 28, 2016 10:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug A general bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants