Corrupted gelf messages with TCP sender #96

koeff · 2016-10-05T14:55:09Z

On high loads some log messages are corrupted or lost. Writed reproducable test.

Environment: java version "1.8.0_102"; logstash-gelf-1.10.0; logback 1.1.7; graylog.

I'm using logback with logstash-gelf-1.10.0 appender, configured using tcp protocol.
Problem is still reproducable on logstash-gelf-1.11.0-SNAPSHOT.

Logback.xml appender config:

<appender name="GELF" class="biz.paluch.logging.gelf.logback.GelfLogbackAppender">
    <host>tcp:${graylogHost}</host>
    <port>12201</port>
    <version>1.1</version>
    <facility>${appName}</facility>

    <extractStackTrace>true</extractStackTrace>
    <filterStackTrace>true</filterStackTrace>

    <timestampPattern>yyyy-MM-dd_HH:mm:ss.SSS</timestampPattern>
    <maximumMessageSize>8192</maximumMessageSize>
</appender>

GelfTcpSender flushing gelf message buffer maven source plugin

mp911de · 2016-10-05T15:15:18Z

pom.xml

@@ -208,6 +208,19 @@
                </executions>
            </plugin>

+            <plugin>


Any particular reason for this change? maven-source-plugin is pulled in through the parent pom and executed on release builds.

No, there is no reason to this change. May be ignored.

mp911de · 2016-10-05T15:32:14Z

Thanks for your PR. Could you help me understand the issue? GelfTCPSender was synchronized once but that lead to performance issues for some users. I currently do not understand why synchronization could solve the issue. SocketChannelImpl synchronizes writes internally (see SocketChannelImpl.write).

Reviewing my code, I found AbstractNioSender.channel is not volatile so a synchronization might resolve visibility issues but adding volatile to AbstractNioSender.channel is required anyways to resolve possible visibility issues with the other senders.

I'm also not sure whether SO_SNDBUF might be an issue. If the send buffer is full it won't accept any data. I'm not sure what happens in such case, the socket could either block or throw an IOException. If it's the exception, then the socket gets closed and reopened what could cause package loss.

koeff · 2016-10-05T17:06:27Z

Ok, I'll try to explain the problem (sorry for my English).
I used logstash-gelf in one project, and sometimes there are errors in the graylog's logs:
2016-09-22T10:24:10.039+03:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=56e8539a-8089-11e6-ae01-002590da0246, journalOffset=-9223372036854775808, codec=gelf, payloadSize=1298, timestamp=2016-09-22T05:56:43.832Z, remoteAddress=/192.168.1.12:43523} com.fasterxml.jackson.core.JsonParseException: Unexpected character ('_' (code 95)): was expecting a colon to separate field name and value at [Source: {"_Severity":"INFO","short_message":"...

So I made research. sniffed traffic from application (which used logstash-gelf) and found, that some gelf messages are broken. This occurs then application sends a lot of messages in a short time. Usually it looked like one incomplete gelf message and next complete gelf message like this:
{"_Severity":"INFO","short_messa{"_Severity":"INFO","short_message":"...<complete gelf message>}

Yes. SocketChannelImpl synchronizes writes internally. But GelfTCPSender uses socketChannel in non-blocking mode (socketChannel.configureBlocking(false);). Javadoc of write method says: "...Some types of channels, depending upon their state, may write only some of the bytes or possibly none at all. A socket channel in non-blocking mode, for example, cannot write any more bytes than are free in the socket's output buffer."
socketChannel.write(byteBuffer) - in non-blocking mode it does not guarantee, that all buffer data will be written.

Writing to socketChannel in loop doesn't solve multiple thread log writers problem. Example: one thread can write a piece of gelf message, then another thread writes own piece of gelf message and then first thread completes writing it gelf message - this two messages blended, stream of gelf messages brokes. To separate multiple gelf writers I used synchronization. We cann't start writing next gelf message to socket before complete writing previous gelf message.

If the send buffer is full it won't accept any data. I'm not sure what happens in such case, the socket could either block or throw an IOException

In non-blocking mode nothing happens. We write as much bytes as possible, and when send buffer fills GelfTCPSender doesn't try to write remaining data in the buffer.

In this PR there is one possible issue fix, most importantly - the idea, may be there is any better solution.

mp911de · 2016-10-05T17:32:05Z

Awesome, thanks for the explanation. Don't worry about your English, it's really good.

So the solution is to make sure that all bytes from the buffer are written and the buffer does not contain any remaining bytes? Message loss could then also happen if the buffer is smaller than the message (say 100 bytes output buffer and 110 message bytes). I will do further investigation on this issue to find out how to fix all affected senders.

koeff · 2016-10-05T18:08:09Z

Yes.
First thing is checking (in non-blocking mode) that all data was written to socket and byteBuffer does not contain remaining bytes.

And second thing - synchronization multiple threads, so only one thread writes gelf message. Another threads should not start writing to a socket, while first thread is not finished writing it's gelf message.

Loss of messages due to buffer size - is at least predictable behavior. We just need to remember about this option.

koeff · 2016-10-06T08:52:28Z

Hello!
Tried disabling socketChannel non-blocking mode.
So there is no need checking byteBuffer remaining bytes while writing to socket and no need synchronizing multiple threads, which write to same socket.

Here is new branch (socketSendBufferOverflowTest remained the same): https://github.com/koeff/logstash-gelf/tree/send-buffer-overflow-2

Is there any advantages of using non-blocking socket channel here? We don't use selectors here.

mp911de · 2016-10-06T09:06:52Z

There's a reason for non-blocking mode: SocketChannel may get disconnected but that's not visible to the application yet (see #88). The sender invokes a read() operation to either retrieve data from the socket buffer or perform a socket read. Broken sockets will throw IOException which is a signal to reconnect. In non-blocking mode, read() with no data does not block.

mp911de · 2016-10-06T12:19:42Z

I took a look on your work and adopted it into ecddbfd. Could you give the changes a try? If so, then I'd merge the changes to master.

koeff · 2016-10-06T14:18:27Z

Reviewed. This commit fixes corrupting messages if there is only one log writer.
There are corrupted gelf messages if there are multiple log writers.

A high write volume can cause package loss when using TCP or UDP senders. This is because of saturated send buffers. NIO channels don't block/continue writing if send buffers are full but return from the call. Inspecting the remaining buffer size is a good measure to determine whether the buffer was written entirely or whether bytes to write are left. Add synchronization to prevent GELF message interleaving. TCP and UDP senders now retry writes until the message buffer is written entirely to the Channel (send buffer). See also PR #96.

mp911de · 2016-10-06T17:59:13Z

Thanks a lot for your kind help to discover and fix multiple issues. I'm closing this PR in favor of a slightly different fix.

GelfTCPSender load testing

475b610

GelfTcpSender flushing gelf message buffer maven source plugin

mp911de reviewed Oct 5, 2016

View reviewed changes

mp911de added the type: bug A general bug label Oct 5, 2016

mp911de added this to the logstash-gelf 1.11.0 milestone Oct 6, 2016

mp911de closed this Oct 6, 2016

koeff deleted the send-buffer-overflow branch October 28, 2016 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted gelf messages with TCP sender #96

Corrupted gelf messages with TCP sender #96

koeff commented Oct 5, 2016

mp911de Oct 5, 2016

koeff Oct 5, 2016

mp911de commented Oct 5, 2016

koeff commented Oct 5, 2016 •

edited

Loading

mp911de commented Oct 5, 2016

koeff commented Oct 5, 2016 •

edited

Loading

koeff commented Oct 6, 2016 •

edited

Loading

mp911de commented Oct 6, 2016

mp911de commented Oct 6, 2016

koeff commented Oct 6, 2016

mp911de commented Oct 6, 2016

Corrupted gelf messages with TCP sender #96

Corrupted gelf messages with TCP sender #96

Conversation

koeff commented Oct 5, 2016

mp911de Oct 5, 2016

Choose a reason for hiding this comment

koeff Oct 5, 2016

Choose a reason for hiding this comment

mp911de commented Oct 5, 2016

koeff commented Oct 5, 2016 • edited Loading

mp911de commented Oct 5, 2016

koeff commented Oct 5, 2016 • edited Loading

koeff commented Oct 6, 2016 • edited Loading

mp911de commented Oct 6, 2016

mp911de commented Oct 6, 2016

koeff commented Oct 6, 2016

mp911de commented Oct 6, 2016

koeff commented Oct 5, 2016 •

edited

Loading

koeff commented Oct 5, 2016 •

edited

Loading

koeff commented Oct 6, 2016 •

edited

Loading