Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory allocations by writing directly into the output stream (#461) #472

Merged
merged 7 commits into from
Jul 5, 2021

Conversation

brenuart
Copy link
Collaborator

@brenuart brenuart commented Jan 5, 2021

Reduce GC pressure by writing directly into the output stream instead of returning a byte array.

Introduce a new (internal) StreamingEncoder interface to be implemented by Encoders that supports writing directly into the output stream instead of returning their results as a byte array.

Update both the AbstractLogstashTcpSocketAppender and the CompositeJsonEncoder to support this new interface. This should hoppefully reduce the amount of short-lived byte arrays created for each log event.

Review test case and make use of actual instances instead of mocks (required to preserve state during test otherwise conditional operations like start/isStarted do not work as expected).

See #461 for more information.

@brenuart
Copy link
Collaborator Author

brenuart commented Jan 5, 2021

Regarding commit 3d68f91, I'm not sure it is actually a good idea to make use of Jackson internal utilities like BufferRecycler, ByteArrayBuilder and SegmentedStringBuilder...

Note also that minBufferSize is useless now :-(
Although ByteBufferBuilder has a constructor that accepts both an initialSize and a BufferRecycler, it ignores the initial size when creating the initial buffer and instead gets one from the BufferRecycler whose size is hardcoded to 2000 bytes by default. With this in mind, the minimal size of a ByteBufferBuilder is always 2000 bytes, not less - but it may be larger.

New buffers are allocated dynamically as data is written into the ByteBufferBuilder. They are allocated on the heap directly and don't come from the BufferRecycler (see ByteArrayBuilder.java#L294). As the comments says in the code, the size of the newly allocated buffer is half the total size of what is allocated so far (with a max value).

Let's take an example and create a new ByteBufferBuilder and give it a BufferRecycler. A first byte array of 2000 bytes is allocated (cold start). Then we write 3001 bytes - here is what happens:

  • the first 2000 bytes are stored in the initial buffer;
  • a new buffer needs to be allocated for the remaining 1001 bytes. As per the logic described above, the builder allocates a new buffer of half the size of what is currently allocated, i.e. 1000 bytes. The allocated memory is now 2000 + 1000 = 3000 bytes;
  • the next 1000 bytes are written in the tail buffer - 1 byte remains. A new buffer needs to be allocated whose size will be 3000 / 2 = 1500 bytes.
  • the last byte is written in the tail buffer. Total allocated memory is 2000 + 1000 + 1500 = 4500 bytes.

When done, you need to call ByteBufferBuilder.release() to return memory to the BufferRecycler. The release process returns the tail buffer to the recycler (see ByteArrayBuilder.java#L96) - meaning that in our example, the array of 1500 bytes is returned to the recycler.

Let's run the example a second time with the same recycler (purpose is to somehow reuse previous byte arrays to reduce memory allocation). We create a new ByteArrayBuilder that immediately asks the recycler for a first initial buffer. Unfortunately, the recycler now holds an array of size 1500 which is LESS than the hardcoded minimum set to 2000... So instead of returning the cached array, it discards it and creates a new one of 2000 bytes. The rest happens exactly as during the first - cold - run.

Conclusion: we have not saved/reused anything at all!

I'm puzzled to be honest. I don't see how this mechanism can provide an efficient caching... except when what you write never (or rarely) exceeds the hardcoded minimum (2000 bytes).

My feeling is that we'd better write our own mechanism instead of relying of something that is especially designed for use by Jackson internals only... May be I should create an issue on Jackson-Core and see what Tatu has to say about it... (no doubt it will first tell me I should not use it).

@brenuart
Copy link
Collaborator Author

brenuart commented Jan 9, 2021

Additional notes:

(1)
I made another simpler version of this PR (not committed) with a custom ByteArrayOutputStream that grows by allocating additional chunks (to avoid the memory copy that occurs in the JDK version when it grows). Like Jackson's approach, the buffer recycles only the first chunk (whose size is determined by the minBufferSize configuration property) and discards the others. This works like a charm provided minBufferSize is set to an optimal value. The buffer is stored in a ThreadLocal with a SoftReference so it can be garbage collected should the vm decides to do so...

For information, in production we ingest approximatively 15K events/s - for one hour during a working day, average JSON size is around 540 bytes, p95 is at 602 and the largest is several kilobytes (holds a stack trace) . An initial buffer size of 1k or 2k seems to be a good default value.

I quite like this solution. I wonder if we need to go any further... After all, the rest of the chain is not optimised... look at the Layout for instance... it returns a String... and most Logback components are not designed to Stream their output...

(2)
Asking the encoder to write directly in the output stream also means that the stream may be "corrupted" if the encoder fails in the middle of processing a log event (it may have written some bytes already at the time it throws an exception). This does not happen when the Encoder returns its result as a byte array.
I would say that's okay because it only happens in exceptional situations where the encoder has a problem and can't do its job anyway. It can also happen when the JVM is stopped abruptly. With TCP, it can also happen when a connection is lost - the TCP buffers/window may hold only the first bytes of the last JSON even... This means that the receiving side (the collector) must be ready to receive "incorrect" data anyway...

…f returning a byte array

Introduce a new (internal) StreamingEncoder interface to be implemented by Encoders that supports writing directly into the output stream instead of returning their results as a byte array.

Update both the AbstractLogstashTcpSocketAppender and the CompositeJsonEncoder to support this new interface. This should hoppefully reduce the amount of short-lived byte arrays created for each log event.

See logfellow#461 for more information.
@philsttr
Copy link
Collaborator

philsttr commented Jan 9, 2021

In general, the StreamingEncoder implementation looks great! I'm happy to incorporate it into logstash-logback-encoder.

The release process returns the tail buffer to the recycler

The fact that it only returned the tail buffer was surprising to me. Thanks for the detailed analysis! Good stuff.

A custom buffer recycler sounds perfectly reasonable, unless it is crazy complex. Want to submit it as an alternate PR for comparison?

@brenuart
Copy link
Collaborator Author

brenuart commented Jul 5, 2021

Hi Phil... I'm back with this optimisation. A few words about what I did so far...

The ReusableByteBuffer is some kind of fast alternative to JDK ByteArrayOutputStream with two notable differences. First off all it maintains multiple byte[] arrays when it needs to grow and avoids copying data as does the JDK implementation. Secondly, instead of discarding all the internal byte arrays when reset, the buffer always keeps the first. The idea is to set its size large enough to hold at least the "average" json event and reduce the number of required allocations.

ReusableByteBuffers acts as a pool of ReusableByteBuffer instances.
The first implementation used a ThreadLocal to keep one instance per thread. After some thought, it appears this strategy may lead to a higher-than-required memory consumption with as many buffers as we have threads accessing the Encoder. This may not be a problem with Async appenders (only one thread) but may become serious with other appenders like FileAppender or ConsoleAppender for instance. If you look carefully at their implementation you will notice that multiple threads can access the Encoder - only the writing into the underlying output stream is guarded.
The new implementation maintains a ConcurrentLinkedList with SoftReference to allocated buffers. A new buffer is created when the list is empty. They are discarded when the JVM is under high memory pressure and decides to reclaim them.

The CompositeJsonEncoder can be used as a "standard" Encoder returning a byte[] array or as a StreamingEncoder to render directly inside an outputStream passed as argument together with the event to render. As for now, only the LogstashTcpSocketAppender supports this method.
The encoder maintains its own pool of buffers. Their initial size is controlled by the existing minBufferSize parameter. These buffers are used by the legacy "encode" methods returning a byte array. When it needs to encode an event, it gets a buffer from the pool and asks the formatter to render it directly into the buffer. When done, the buffer is returned to the pool and reused by subsequent events.

The CompositeJsonLayout follows the same logic. As you noticed, it also uses the same reusable buffers as the CompositeJsonEncoder. Because it needs to return a String instead of bytes, it uses a Writer around the buffer and asks the json formatter to render the event using it. This strategy is similar to what was done before, except it is now moved inside the CompositeJsonLayout instead of in the CompositeJsonFormatter. This implementation is not as "highly" optimised as for the encoder... but AFAIK Layouts are somehow deprecated (correct?).

I made a small benchmark using JMH to compare this implementation with version 6.6. Here are the results:

  • speed: +23%
  • GC count: -68%
  • GC Time: -64%
  • Allocations: -76%

Not bad... especially with regards to the positive impacts on memory and GC.
But we can do even better! The bottleneck now is in the CompositeJsonFormatter that creates a new Jackson JsonGenerator for every event. I made some additional experiments by reusing the same JsonEncoder as long as the output stream does not change. This gives the following improvements compared to 6.6:

  • speed: +46%
  • GC count: -87%
  • GC Time: -86%
  • Allocations: -87%

This of course requires additional changes. I propose to cover it in a separate issue. The improvements provided by this PR already seem to be good enough ;-) Tell me what you think...

Copy link
Collaborator

@philsttr philsttr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good!

The performance numbers you posted are impressive.

My comments are mostly minor. Overall it's a really great implementation.

@philsttr philsttr added this to the 7.0 milestone Jul 5, 2021
@philsttr philsttr merged commit fa81678 into logfellow:main Jul 5, 2021
@philsttr
Copy link
Collaborator

philsttr commented Jul 5, 2021

Thanks again for your work on this! Very cool improvement.

(And thanks for addressing all my code review comments. I know I can be picky at times. ;) Cheers!

@brenuart
Copy link
Collaborator Author

brenuart commented Jul 5, 2021

👍
I noticed however that my PR changed the order of import statements :-( Should it be part of the checkstyle rules as well?

@brenuart brenuart deleted the gh461 branch July 5, 2021 23:05
@philsttr
Copy link
Collaborator

philsttr commented Jul 5, 2021

I think the import order was all over the place when I originally enabled the checkstyle rules. I tried to create rules that matched the existing codebase as much as possible to minimize reformatting, and to "stop the bleeding".

Having said that, I'm not against defining import order rules and reorganizing the existing imports. If there is a clear majority practice, I'd like to stick with it. But I'm less concerned about changing imports than reformatting code in general because I rarely use git blame to look at imports. I'm usually looking at actual code when using git blame, and therefore am much more hesitant to reformat the actual code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce GC pressure in CompositeJsonEncoder and *LogstashTcpSocketAppender
2 participants