Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8229517: Support for optional asynchronous/buffered logging #3135

Open
wants to merge 24 commits into
base: master
from

Conversation

@navyxliu
Copy link
Contributor

@navyxliu navyxliu commented Mar 22, 2021

This patch provides a buffer to store asynchrounous messages and flush them to
underlying files periodically.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Integration blocker

 ⚠️ The change requires a CSR request to be approved.

Issue

  • JDK-8229517: Support for optional asynchronous/buffered logging

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/3135/head:pull/3135
$ git checkout pull/3135

Update a local copy of the PR:
$ git checkout pull/3135
$ git pull https://git.openjdk.java.net/jdk pull/3135/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 3135

View PR using the GUI difftool:
$ git pr show -t 3135

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/3135.diff

Xin Liu added 4 commits Mar 22, 2021
Add an option async for the file-based outputs. The option conforms to
output-option in JEP-158.
fix a warning from c++11: "The generation of the implicitly-defined copy
assignment operator is deprecated if T has a user-declared destructor or
user-declared copy constructor."
correct assertion.
@bridgekeeper
Copy link

@bridgekeeper bridgekeeper bot commented Mar 22, 2021

👋 Welcome back xliu! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr label Mar 22, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Mar 22, 2021

@navyxliu The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot label Mar 22, 2021
@dholmes-ora
Copy link
Member

@dholmes-ora dholmes-ora commented Mar 23, 2021

/csr needed

@openjdk openjdk bot added the csr label Mar 23, 2021
@openjdk
Copy link

@openjdk openjdk bot commented Mar 23, 2021

@dholmes-ora has indicated that a compatibility and specification (CSR) request is needed for this pull request.
@navyxliu please create a CSR request and add link to it in JDK-8229517. This pull request cannot be integrated until the CSR request is approved.

@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Mar 23, 2021

Hi, Reviewers,

I would like to restart the RFR process for the feature async logging. We (AWS) have deployed this feature over a year in a few critical services. It helps us to reduce long-tail GC pauses. On Linux, we used to experience intermittent second-level delays due to gclog writings. If those undesirable events happen to appear at safepoints, hotspot has to prolong the pause intervals, which then increase the response time of Java application/service.

Originally, we observed and solved this issue on a Linux system with software RAID. In absence of hardware assistance, multiple writes have to be synchronized and it is that operation that yields long pauses. This issue may become more prevalent if Linux servers adopt ZFS in the future. We don’t think redirecting log files to tmpfs is a final solution. Hotspot should provide a self-contained and cross-platform solution. Our solution is to provide a buffer and flush it in a standalone thread periodically.

Since then, we found more unexpected but interesting scenarios. e.g. some cloud-based applications run entirely on a AWS EBS partition. syscall write could be a blocking operation if the underlying infrastructure is experiencing an intermittent issue. Even stdout/stderr are not absolutely non-blocking. Someone may send XOFF of software flow control and pause the tty to read. As a result, the JVM process which is emitting logs to the tty is blocked then. Yes, that action may freeze the Java service accidentally!

Those pain points are not AWS-exclusive. We found relevant questions on stackoverflow[1] and it seems that J9 provides an option -Xgc:bufferedLogging to mitigate it[2]. We hope hotspot would consider our feature.

Back to implementation, this is the 2nd revision based on Unified logging. Previous RFR[3] was a top-down design. We provide a parallel header file asynclog.hpp and hope log-sites opt in. That design is very stiff because asynclog.hpp is full of template parameters and was turned down[4]. The new patch has deprecated the old design and achieved asynchronous logging in bottom-up way. We provide an output-option which conforms to JEP-158[5]. Developers can choose asynchronous mode for a file-based output by providing an extra option async=true. e.g. -Xlog:gc=debug:file=gc.log::async=true

May we know more about LogMessageBuffer.hpp/cpp? We haven’t found a real use of it. That’s why we are hesitating to support LogFileOutput::write(LogMessageBuffer::Iterator msg_iterator). Further, we haven’t supported async_mode for LogStdoutOutput and LogStderrOutput either. It’s not difficult but needs to big code change.

[1] https://stackoverflow.com/questions/27072340/is-gc-log-writing-asynchronous-safe-to-put-gc-log-on-nfs-mount
[2] https://stackoverflow.com/questions/54994943/is-openj9-gc-log-asynchronous
[3] https://cr.openjdk.java.net/~xliu/8229517/01/webrev/
[4] https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-March/041034.html
[5] https://openjdk.java.net/jeps/158

Copy link
Member

@YaSuenag YaSuenag left a comment

I think this PR is very useful for us!

May we know more about LogMessageBuffer.hpp/cpp? We haven’t found a real use of it. That’s why we are hesitating to support LogFileOutput::write(LogMessageBuffer::Iterator msg_iterator). Further, we haven’t supported async_mode for LogStdoutOutput and LogStderrOutput either. It’s not difficult but needs to big code change.

LogMessageBuffer is used in LogMessage. For example, we can see it as following. Frame # 1 is LogFileOutput::write(LogMessageBuffer::Iterator msg_iterator). IMHO we do not need to change LogStdout/errOutput, but it is better to change LogMessageBuffer.

#0  LogFileStreamOutput::write (this=this@entry=0x7ffff0002af0, msg_iterator=...)
    at /home/ysuenaga/github-forked/jdk/src/hotspot/share/logging/logDecorators.hpp:108
#1  0x00007ffff6e80e8e in LogFileOutput::write (this=this@entry=0x7ffff0002af0, msg_iterator=...)
    at /home/ysuenaga/github-forked/jdk/src/hotspot/share/logging/logFileOutput.cpp:314
#2  0x00007ffff6e876eb in LogTagSet::log (
    this=this@entry=0x7ffff7d4a640 <LogTagSetMapping<(LogTag::type)16, (LogTag::type)68, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::_tagset>, msg=...) at /home/ysuenaga/github-forked/jdk/src/hotspot/share/logging/logTagSet.cpp:85
#3  0x00007ffff6a194df in LogImpl<(LogTag::type)16, (LogTag::type)68, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::write (
    msg=...) at /home/ysuenaga/github-forked/jdk/src/hotspot/share/logging/logTagSet.hpp:150
#4  LogMessageImpl<(LogTag::type)16, (LogTag::type)68, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::flush (
    this=0x7ffff58675d0) at /home/ysuenaga/github-forked/jdk/src/hotspot/share/logging/logMessage.hpp:79
#5  LogMessageImpl<(LogTag::type)16, (LogTag::type)68, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::~LogMessageImpl (
    this=0x7ffff58675d0, __in_chrg=<optimized out>) at /home/ysuenaga/github-forked/jdk/src/hotspot/share/logging/logMessage.hpp:74
#6  InstanceKlass::print_class_load_logging (this=this@entry=0x800007430, loader_data=loader_data@entry=0x7ffff00f5200,
    module_entry=module_entry@entry=0x0, cfs=cfs@entry=0x0) at /home/ysuenaga/github-forked/jdk/src/hotspot/share/oops/instanceKlass.cpp:3647
src/hotspot/share/runtime/globals.hpp Outdated Show resolved Hide resolved
src/hotspot/share/runtime/globals.hpp Outdated Show resolved Hide resolved
@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Mar 25, 2021

I think this PR is very useful for us!

May we know more about LogMessageBuffer.hpp/cpp? We haven’t found a real use of it. That’s why we are hesitating to support LogFileOutput::write(LogMessageBuffer::Iterator msg_iterator). Further, we haven’t supported async_mode for LogStdoutOutput and LogStderrOutput either. It’s not difficult but needs to big code change.

LogMessageBuffer is used in LogMessage. For example, we can see it as following. Frame # 1 is LogFileOutput::write(LogMessageBuffer::Iterator msg_iterator). IMHO we do not need to change LogStdout/errOutput, but it is better to change LogMessageBuffer.

#0  LogFileStreamOutput::write (this=this@entry=0x7ffff0002af0, msg_iterator=...)
    at /home/ysuenaga/github-forked/jdk/src/hotspot/share/logging/logDecorators.hpp:108
#1  0x00007ffff6e80e8e in LogFileOutput::write (this=this@entry=0x7ffff0002af0, msg_iterator=...)
    at /home/ysuenaga/github-forked/jdk/src/hotspot/share/logging/logFileOutput.cpp:314
#2  0x00007ffff6e876eb in LogTagSet::log (
    this=this@entry=0x7ffff7d4a640 <LogTagSetMapping<(LogTag::type)16, (LogTag::type)68, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::_tagset>, msg=...) at /home/ysuenaga/github-forked/jdk/src/hotspot/share/logging/logTagSet.cpp:85
#3  0x00007ffff6a194df in LogImpl<(LogTag::type)16, (LogTag::type)68, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::write (
    msg=...) at /home/ysuenaga/github-forked/jdk/src/hotspot/share/logging/logTagSet.hpp:150
#4  LogMessageImpl<(LogTag::type)16, (LogTag::type)68, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::flush (
    this=0x7ffff58675d0) at /home/ysuenaga/github-forked/jdk/src/hotspot/share/logging/logMessage.hpp:79
#5  LogMessageImpl<(LogTag::type)16, (LogTag::type)68, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0, (LogTag::type)0>::~LogMessageImpl (
    this=0x7ffff58675d0, __in_chrg=<optimized out>) at /home/ysuenaga/github-forked/jdk/src/hotspot/share/logging/logMessage.hpp:74
#6  InstanceKlass::print_class_load_logging (this=this@entry=0x800007430, loader_data=loader_data@entry=0x7ffff00f5200,
    module_entry=module_entry@entry=0x0, cfs=cfs@entry=0x0) at /home/ysuenaga/github-forked/jdk/src/hotspot/share/oops/instanceKlass.cpp:3647

hi, @YaSuenag,

Thank you for providing the stacktrace! I didn't notice <logMessage.hpp> until you point out. Now I understand the rationale and usecases of logMessageBuffer. Let me figure out how to support it.

IIUC, the most important attribute of LogMessage is to guarantee messages are consecutive, or free from interleaving. I will focus on it.

@tstuefe
Copy link
Member

@tstuefe tstuefe commented Mar 25, 2021

Hi Xin,

I skimmed over the patch, but have a number of high level questions - things which have not been clear from your description.

  • Who does the writing, and who is affected when the writing stalls?
  • Do you then block or throw output away?
    • If the former, how do you mitigate the ripple effect?
    • If the latter, how does the reader of the log file know that something is missing?
  • How often do you flush? How do you prevent missing output in the log file in case of crashes?
  • Can this really the full brunt of logging (-Xlog:*=trace) over many threads?
  • Does this work with multiple target and multiple IO files?
  • Does it cost anything if logging is off or not async?

Update: Okay, I see you use PeriodicTask and the WatcherThread. Is this really enough? I would be concerned that it either runs too rarely to be able to swallow all output or that it runs that often that it monopolizes the WatcherThread.

I actually expected a separate Thread - or multiple, one per output - for this, waking up when there is something to write. That would also be more efficient than constant periodic polling.

  • How is the performance impact when we have lots of concurrent writes from many threads? I see that you use a Mutex to synchronize the logging threads with the flush service. Before, these threads would have done concurrent IO and that would be handled by the libc, potentially without locking.

I think this feature could be useful. I am a bit concerned with the increased complexity this brings. UL is already a very (I think unnecessarily) complex codebase. Maybe we should try to reduce its complexity first before adding new features to it. This is just my opinion, lets see what others think.

Cheers, Thomas

@tstuefe
Copy link
Member

@tstuefe tstuefe commented Mar 25, 2021

p.s. I like the integration into UL via a target modification btw. That feels very organic.

@mlbridge
Copy link

@mlbridge mlbridge bot commented Mar 26, 2021

Mailing list message from Thomas St��fe on hotspot-dev:

Can you please link the CSR to the issue?

On Fri, Mar 26, 2021 at 8:21 AM Yasumasa Suenaga <ysuenaga at openjdk.java.net>
wrote:

@robehn
Copy link
Contributor

@robehn robehn commented Mar 26, 2021

Hi,

This is flushed by the watcher thread (non-JavaThread).
Flushing can thus happen during a safepoint and one or more safepoints may have passed between the actual logging and the flushing.
If the VM thread logs it can be delayed while watcher thread does "pop_all()" it seems like.
I suppose pop_all can take a while if you have a couple of thousands of logs messages?

We can also change log-configuration during run-time, e.g. turn on/off logs via jcmd.
Wouldn't it be more natural to flush the async logs-lines before we update the log configuration? (e.g. if you turn off a log via jcmd, we flush the async buffer before)

Thanks

Copy link
Member

@simonis simonis left a comment

Hi Xin,

thanks for finally addressing this issue. In general your change looks good. Please find my detailed comments inline.

Best regards,
Volker

src/hotspot/share/runtime/globals.hpp Outdated Show resolved Hide resolved
src/hotspot/share/logging/logAsyncFlusher.hpp Outdated Show resolved Hide resolved
src/hotspot/share/logging/logAsyncFlusher.hpp Outdated Show resolved Hide resolved
return _tail == NULL ? NULL : _tail->peek();
}

void log_drop(E* e) {}

This comment has been minimized.

@simonis

simonis Mar 26, 2021
Member

See comment on pop_front(). I'd remove this function entirely and handle it right in LogAsyncFlusher::enqueue().

src/hotspot/share/logging/logAsyncFlusher.cpp Outdated Show resolved Hide resolved
src/hotspot/share/logging/logAsyncFlusher.hpp Outdated Show resolved Hide resolved
src/hotspot/share/logging/logFileOutput.cpp Outdated Show resolved Hide resolved
test/hotspot/gtest/logging/test_asynclog.cpp Outdated Show resolved Hide resolved
@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Mar 26, 2021

hi, Thomas,

Thank you for reviewing this PR.

Hi Xin,

I skimmed over the patch, but have a number of high level questions - things which have not been clear from your description.

  • Who does the writing, and who is affected when the writing stalls?

The WatchThread eventually flushes those buffered messages. if the writing stalls, it blocks periodic tasks.
It blocks long enough, other periodic tasks are skipped.

  • Do you then block or throw output away?

    • If the former, how do you mitigate the ripple effect?
    • If the latter, how does the reader of the log file know that something is missing?

The capacity of buffer is limited, which is AsyncLogBufferSize (2k by default).
Actually, logTagSet.cpp limits the maximal length of a vwrite is 512. That means that maximal memory used by this buffer is 1M (=2k * 0.5k).

If the buffer overflows, it starts dropping the heads. this behavior simulates a ringbuffer.
If you enable -XX:+Verbose, the dropping message will be printed to the tty console.

I prefer to drop messages than keeping them growing because later may trigger out-of-memory error.

  • How often do you flush? How do you prevent missing output in the log file in case of crashes?

The interval is defined by LogAsyncInterval (300ms by default). I insert a statement async->flusher() in ostream_abort().

  • Can this really the full brunt of logging (-Xlog:*=trace) over many threads?
    to be honest, it can't. I see a lot of dropping messages on console with -XX:+Verbose.

I have tuned parameters that it won't drop messages easily for normal GC activity with info verbosity.
-Xlog:*=trace will drop messages indeed, but this is tunable. I have a stress test to tweak parameters.

  • Does this work with multiple target and multiple IO files?

Yes, it works if you have multiple outputs. LogAsyncFlusher is singleton. one single buffer and one thread serve them all.

  • Does it cost anything if logging is off or not async?

so far, LogAsyncFlusher as a periodic task remains active even no output is in async_mode.
it wakes up every LogAsyncInterval ms. it's a dummy task because the deque is always empty. the cost is almost nothing.

Update: Okay, I see you use PeriodicTask and the WatcherThread. Is this really enough? I would be concerned that it either runs too rarely to be able to swallow all output or that it runs that often that it monopolizes the WatcherThread.

I actually expected a separate Thread - or multiple, one per output - for this, waking up when there is something to write. That would also be more efficient than constant periodic polling.

You concern is reasonable. I don't understand why there is only one watchThread and up to 10 periodic tasks are crowded in it.
If it's a bottleneck, I plan to improve this infrastructure. I can make hotspot supports multiple watcher threads and spread periodic tasks among them. All watcher threads are connected using linked list to manage.

Can we treat it as a separated task? for normal usage, I think the delay is quite managed. Writing thousands of lines to a file usually can be done in sub-ms.

  • How is the performance impact when we have lots of concurrent writes from many threads? I see that you use a Mutex to synchronize the logging threads with the flush service. Before, these threads would have done concurrent IO and that would be handled by the libc, potentially without locking.

IMHO, logging shouldn't hurt performance a lot. At least, those do impact on performance are not supposed to enable by default. On the other side, I hope logging messages from other threads avoid from interweaving when I enable them to read.
That leads me to use mutex. That actually improves readability.

My design target is non-blocking. pop_all() is an ad-hoc operation which pop up all elements and release the mutex immediately. writeback() does IO without it.

In our real applications, we haven't seen this feature downgrade GC performance yet.

I think this feature could be useful. I am a bit concerned with the increased complexity this brings. UL is already a very (I think unnecessarily) complex codebase. Maybe we should try to reduce its complexity first before adding new features to it. This is just my opinion, lets see what others think.

Cheers, Thomas

I believe UL has its own reasons. In my defense, I don't make UL more complex. I only changed a couple of lines in one of its implementation file(logFileOutput.cpp) and didn't change its interfaces.

I try my best to reuse existing codebase. We can always refactor existing code(JDK-8239066, JDK-8263840), but it's not this PR's purpose.

thanks,
--lx

@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Mar 27, 2021

Mailing list message from Thomas St��fe on hotspot-dev:

Can you please link the CSR to the issue?

On Fri, Mar 26, 2021 at 8:21 AM Yasumasa Suenaga
wrote:

https://bugs.openjdk.java.net/browse/JDK-8264323

Copy link
Member

@tstuefe tstuefe left a comment

Hi Xin,

thank you for your detailed answers.

As I wrote, I think this is a useful change. A prior design discussion with a rough sketch would have made things easier. Also, it would have been good to have the CSR discussion beforehand, since it affects how complex the implementation needs to be. I don't know whether there had been design discussions beforehand; if I missed them, I apologize.

I am keenly aware that design discussions often lead nowhere because no-one answers. So I understand why you started with a patch.

About your proposal:

I do not think it can be made airtight, and I think that is okay - if we work with a limited flush buffer and we log too much, things will get dropped, that is unavoidable. But it has to be reliable and comprehensible after the fact.

As you write, the patch you propose works well with AWS, but I suspect that is an environment with limited variables, and outside use of the VM could be much more diverse. We must make sure to roll out only well designed solutions which work for us all.

E.g. a log system which randomly omits log entries because some internal buffer is full without giving any indication in the log itself is a terrible idea :). Since log files are a cornerstone for our support, I am interested in a good solution.

First off, the CSR:


  1. With what you propose, we could have a arbitrary combination of targets with different log files and different async options:
java -Xlog:os*:file=os.log::async=false -Xlog:os+thread:file=thread.log::async=true

Do we really need that much freedom? How probable is that someone wants different async options for different trace sinks? The more freedom we have here the more complex the implementation gets. All that stuff has to be tested. Why not just make "async" a global setting.

  1. AsyncLogBufferSize should be a user-definable memory size, not "number of slots". The fact that internally we keep a vector of disjunct memory snippets is an implementation detail; the user should just give a memory size and the VM should interprete this. This leaves us the freedom to later change the implementation as we see fit.

  2. LogAsyncInterval should not exist at all. If it has to exist, it should be a diagnostic switch, not a production one; but ideally, we would just log as soon as there is something to log, see below.


Implementation:

The use of the WatcherThread and PeriodicTask. Polling is plain inefficient, beside the concerns Robbin voiced about blocking things. This is a typical producer-consumer problem, and I would implement it using an own dedicated flusher thread and a monitor. The flusher thread should wake only if there is something to write. This is something I would not do in a separate RFE but now. It would also disarm any arguments against blocking the WatcherThread.


The fact that every log message gets strduped could be done better. This can be left for a future RFE - but it explains why I dislike "AsyncLogBufferSize" being "number of entries" instead of a memory size.

I think processing a memory-size AsyncLogBufferSize can be kept simple: it would be okay to just guess an average log line length and go with that. Lets say 256 chars. An AsyncLogBufferSize=1M could thus be translated to 4096 entries in your solution. If the sum of all 4096 allocated lines overshoots 1M from time to time, well so be it.

A future better solution could use a preallocated fixed sized buffer. There are two ways to do this, the naive but memory inefficient way - array of fixed sized text slots like the event system does. And a smart way: a ring buffer of variable sized strings, '\0' separated, laid out one after the other in memory. The latter is a bit more involved, but can be done, and it would be fast and very memory efficient. But as I wrote, this is an optimization which can be postponed.


I may misunderstand the patch, but do you resolve decorators when the flusher is printing? Would this not distort time-dependent decorators (timemillis, timenanos, uptime etc)? Since we record the time of printing, not the time of logging?.

If yes, it may be better to resolve the message early and just store the plain string and print that. Basically this would mean to move the whole buffering down a layer or two right at where the raw strings get printed. This would be vastly simplified if we abandon the "async for every trace sink" notion in favor of just a global flag.

This would also save a bit of space, since we would not have to carry all the meta information in AsyncLogMessage around. I count at least three 64bit slots, possibly 4-5, which alone makes for ~40 bytes per message. Resolved decorators are often smaller than that.

Please find further remarks inline.

hi, Thomas,

Thank you for reviewing this PR.

Hi Xin,
I skimmed over the patch, but have a number of high level questions - things which have not been clear from your description.

  • Who does the writing, and who is affected when the writing stalls?

The WatchThread eventually flushes those buffered messages. if the writing stalls, it blocks periodic tasks.
It blocks long enough, other periodic tasks are skipped.

  • Do you then block or throw output away?

    • If the former, how do you mitigate the ripple effect?
    • If the latter, how does the reader of the log file know that something is missing?

The capacity of buffer is limited, which is AsyncLogBufferSize (2k by default).
Actually, logTagSet.cpp limits the maximal length of a vwrite is 512. That means that maximal memory used by this buffer is 1M (=2k * 0.5k).

If the buffer overflows, it starts dropping the heads. this behavior simulates a ringbuffer.
If you enable -XX:+Verbose, the dropping message will be printed to the tty console.

I prefer to drop messages than keeping them growing because later may trigger out-of-memory error.

  • How often do you flush? How do you prevent missing output in the log file in case of crashes?

The interval is defined by LogAsyncInterval (300ms by default). I insert a statement async->flusher() in ostream_abort().

If the flusher blocks, this could block VM shutdown? Would this be different from what we do now, e.g. since all log output is serialized and done by one thread? Its probably fine, but we should think about this.

  • Can this really the full brunt of logging (-Xlog:*=trace) over many threads?
    to be honest, it can't. I see a lot of dropping messages on console with -XX:+Verbose.

I have tuned parameters that it won't drop messages easily for normal GC activity with info verbosity.
-Xlog:*=trace will drop messages indeed, but this is tunable. I have a stress test to tweak parameters.

  • Does this work with multiple target and multiple IO files?

Yes, it works if you have multiple outputs. LogAsyncFlusher is singleton. one single buffer and one thread serve them all.

The question was how we handle multiple trace sinks, see my "CSR" remarks.

  • Does it cost anything if logging is off or not async?

so far, LogAsyncFlusher as a periodic task remains active even no output is in async_mode.
it wakes up every LogAsyncInterval ms. it's a dummy task because the deque is always empty. the cost is almost nothing.

Update: Okay, I see you use PeriodicTask and the WatcherThread. Is this really enough? I would be concerned that it either runs too rarely to be able to swallow all output or that it runs that often that it monopolizes the WatcherThread.
I actually expected a separate Thread - or multiple, one per output - for this, waking up when there is something to write. That would also be more efficient than constant periodic polling.

You concern is reasonable. I don't understand why there is only one watchThread and up to 10 periodic tasks are crowded in it.
If it's a bottleneck, I plan to improve this infrastructure. I can make hotspot supports multiple watcher threads and spread periodic tasks among them. All watcher threads are connected using linked list to manage.

Can we treat it as a separated task? for normal usage, I think the delay is quite managed. Writing thousands of lines to a file usually can be done in sub-ms.

  • How is the performance impact when we have lots of concurrent writes from many threads? I see that you use a Mutex to synchronize the logging threads with the flush service. Before, these threads would have done concurrent IO and that would be handled by the libc, potentially without locking.

IMHO, logging shouldn't hurt performance a lot. At least, those do impact on performance are not supposed to enable by default. On the other side, I hope logging messages from other threads avoid from interweaving when I enable them to read.
That leads me to use mutex. That actually improves readability.

My design target is non-blocking. pop_all() is an ad-hoc operation which pop up all elements and release the mutex immediately. writeback() does IO without it.

Since you use a mutex it introduces synchronization, however short, across all logging threads. So it influences runtime behavior. For the record, I think this is okay; maybe a future RFE could improve this with a lockless algorithm. I just wanted to know if you measured anything, and I was curious whether there is a difference now between synchronous and asynchronous logging.

(Funnily, asynchronous logging is really more synchronous in a sense, since it synchronizes all logging threads across a common resource).

In our real applications, we haven't seen this feature downgrade GC performance yet.

I think this feature could be useful. I am a bit concerned with the increased complexity this brings. UL is already a very (I think unnecessarily) complex codebase. Maybe we should try to reduce its complexity first before adding new features to it. This is just my opinion, lets see what others think.
Cheers, Thomas

I believe UL has its own reasons. In my defense, I don't make UL more complex. I only changed a couple of lines in one of its implementation file(logFileOutput.cpp) and didn't change its interfaces.

I try my best to reuse existing codebase. We can always refactor existing code(JDK-8239066, JDK-8263840), but it's not this PR's purpose.

I understand. Its fine to do this in a later RFE.

thanks,
--lx

Cheers, Thomas

if (_message != NULL) {
// should cache this object somehow
LogDecorations decorations(_level, _tagset, _decorators);
_output.write_blocking(decorations, _message);

This comment has been minimized.

@tstuefe

tstuefe Mar 27, 2021
Member

Would this mean that time dependent decorators get resolved at print time, not when the log happen?

This comment has been minimized.

@navyxliu

navyxliu Mar 28, 2021
Author Contributor

yes, it is. I just realize that this does distort the timestamps of relevant decorators.

I explain to Volker why I choose to compress LogDecorations to LogDecorators. #3135 (comment)

This issue is similar to "safepoint-bias" in Java profiler. Those timestamps "bias" to the flusher, but the error is limited to AsyncLogInterval. We could spend extra efforts to keep the original timestamps. Is it worth it?

Xin Liu added 2 commits Mar 28, 2021
LogMessage supports async_mode.
remove the option AsyncLogging
renanme  the option GCLogBufferSize to AsyncLogBufferSize
move drop_log() to LogAsyncFlusher.
add a constraint for the option LogAsyncInterval.
@mlbridge
Copy link

@mlbridge mlbridge bot commented Mar 29, 2021

Mailing list message from David Holmes on hotspot-dev:

Hi Xin,

On 27/03/2021 5:30 pm, Thomas Stuefe wrote:

On Mon, 22 Mar 2021 22:12:14 GMT, Xin Liu <xliu at openjdk.org> wrote:

This patch provides a buffer to store asynchrounous messages and flush them to
underlying files periodically.

IMO the discussions last year left it still an open question whether
this was a general purpose facility that we really needed/wanted in
hotspot. I think there is still a larger design discussion to be had for
a general purpose facility versus the in-house version you have been
using that suited your needs.

I'm piggy-backing on some of Thomas's comments below.

Hi Xin,

thank you for your detailed answers.

As I wrote, I think this is a useful change. A prior design discussion with a rough sketch would have made things easier. Also, it would have been good to have the CSR discussion beforehand, since it affects how complex the implementation needs to be. I don't know whether there had been design discussions beforehand; if I missed them, I apologize.

I am keenly aware that design discussions often lead nowhere because no-one answers. So I understand why you started with a patch.

About your proposal:

I do not think it can be made airtight, and I think that is okay - if we work with a limited flush buffer and we log too much, things will get dropped, that is unavoidable. But it has to be reliable and comprehensible after the fact.

As you write, the patch you propose works well with AWS, but I suspect that is an environment with limited variables, and outside use of the VM could be much more diverse. We must make sure to roll out only well designed solutions which work for us all.

E.g. a log system which randomly omits log entries because some internal buffer is full without giving any indication *in the log itself* is a terrible idea :). Since log files are a cornerstone for our support, I am interested in a good solution.

First off, the CSR:

---
1) With what you propose, we could have a arbitrary combination of targets with different log files and different async options:
java -Xlog:os*:file=os.log::async=false -Xlog:os+thread:file=thread.log::async=true

Do we really need that much freedom? How probable is that someone wants different async options for different trace sinks? The more freedom we have here the more complex the implementation gets. All that stuff has to be tested. Why not just make "async" a global setting.

Truly global or global for all actual file-based logging? I think
perhaps the latter. If we need per log file control that could be added
later if needed.

2) AsyncLogBufferSize should be a user-definable memory size, not "number of slots". The fact that internally we keep a vector of disjunct memory snippets is an implementation detail; the user should just give a memory size and the VM should interprete this. This leaves us the freedom to later change the implementation as we see fit.

I'm not sure it should be a bounded size at all. I don't like the idea
of needing yet another VM flag to control this. Why can't the design
accommodate unlimited buffer space as determined by available memory?

3) LogAsyncInterval should not exist at all. If it has to exist, it should be a diagnostic switch, not a production one; but ideally, we would just log as soon as there is something to log, see below.

The logging interval should be configurable IMO, so it either needs a
product switch, or preferably is set as a global logging option via the
Xlog command if that is possible.

---

Implementation:

The use of the WatcherThread and PeriodicTask. Polling is plain inefficient, beside the concerns Robbin voiced about blocking things. This is a typical producer-consumer problem, and I would implement it using an own dedicated flusher thread and a monitor. The flusher thread should wake only if there is something to write. This is something I would not do in a separate RFE but now. It would also disarm any arguments against blocking the WatcherThread.

I agree with Thomas here. Using the WatcherThread for this is not really
appropriate, though I understand it is convenient to use a thread that
is already present and, crucially, one that does not block for
safepoints. And we don't need multiple WatcherThreads or a way to
"spread the load" as the Watcher thread is very lightly loaded. So a new
kind of NonJavaThread should be introduced for this purpose - if it is
truly required - and directly synchronize with it rather than using
polling. Further, we must ensure that the flushing thread cannot block
the VMThread, if the VMThread is doing logging.

----

The fact that every log message gets strduped could be done better. This can be left for a future RFE - but it explains why I dislike "AsyncLogBufferSize" being "number of entries" instead of a memory size.

If we had had async logging from day one then the way we construct log
messages would have been done differently. With all the logging sites
assuming synchronous logging, with stack-local allocations/resources, we
have no choice but to copy messages to memory with a suitable lifetime.

As I started with, I think there needs to be a return to the high level
architectural design discussion for this feature. The PR serves as a
proof-of-concept, but should not IMO be presented as the right general
solution for all hotspot users.

Regards,
David
-----

@tstuefe
Copy link
Member

@tstuefe tstuefe commented Mar 29, 2021

Note that I am really in favor of bringing async logging to UL; this issue bopped up again and again, brought in various forms by various people. It will be good to finally tackle this.

But I agree that talking about the design first would be helpful. Maybe have a little mailing list thread to stop polluting this PR?

@YaSuenag
Copy link
Member

@YaSuenag YaSuenag commented Mar 29, 2021

But I agree that talking about the design first would be helpful. Maybe have a little mailing list thread to stop polluting this PR?

I posted similar diacussion to hotspot-runtime-dev last November. It aims to implement to send UL via network socket. I believe this PR helps it.

https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2020-November/043427.html

@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Mar 30, 2021

But I agree that talking about the design first would be helpful. Maybe have a little mailing list thread to stop polluting this PR?

I posted similar diacussion to hotspot-runtime-dev last November. It aims to implement to send UL via network socket. I believe this PR helps it.

https://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2020-November/043427.html

Interesting. This design diagram is similar to this PR, but I don't think it is a good idea to have a blocking message buffer.
As mentioned in prior thread, it makes hotspot be more subject to external factors. TCP/UPD is an even more representative example of blocking IO than harddrive, isn't it?

Design and its Rationale

For async-logging feature, we proposed a lossy non-blocking design here. A bounded deque or ringbuffer gives a strong guarantee that log sites won't block java threads and the critical internal threads. This is the very problem we are meant to solve.

It can be proven that we cannot have all three guarantees at the same time: non-blocking, bounded memory and log fidelity. To overcome blocking I/O, which sometimes is not under our control, we think it's fair to trade log fidelity for non-blocking. If we kept fidelity and chose unbound buffer, we could end up with some spooky out-of-memory errors on some resource-constrained hardwares. We understand that the platforms hotspot running range from powerful servers to embedded devices. By leaving the buffer size adjustable, we can fit more scenarios. Nevertheless, with a bounded buffer, we believe developers can still capture important logging traits as long as the window is big enough and log messages are consecutive. The current implementation does provide those two merits.

A new proposal based on current implementation

I agree with reviewers' comments above. It's questionable to use the singleton WatcherThread to do IO-intensive job here. It may hinder other tasks. David's guess is right. I was not familiar with hotspot thread and quite frustrated to deal with a special-task thread's lifecycle. That why I used PeriodicTask. I feel more confident to take that challenge again.

Just like Yasumasa depicted, I can create a dedicated NonJavaThread to flush logs instead. Yesterday, I found WatcherThread::unpark() uses its monitor to wake up other pending tasks. I think we can implement in this way. Once log sites observe the buffer is half-full, it uses monitor::notify() to resume flusher thread. I think logging event is high-frequent but less contentious. Waking it up for each log message is not so economical. I have a lossy buffer anyway, so I suggest to have two checkpoints only: 1) half-full. 2) full.

Wrap it up

We would like to propose a lossy design of async-logging in this PR. It is a trade off, so I don't think it's a good idea to handle all logs in async mode. In practice, we hope people only choose async-logging for those logs which may happen at safepoints.

I understand Yasumasa's problem. If you would like to consider netcat or nfs/sshfs, I think your problem can still be solved in the existing file-based output. In this way, you can also utilize this feature by setting your "file" output async mode, then it makes your hotspot non-blocking over TCP as well.

@YaSuenag
Copy link
Member

@YaSuenag YaSuenag commented Mar 30, 2021

This proposal mostly looks good to me, but it is better if async support is implement in higher level class.
LogFileOutput class inherited as following, and you modified LogFileOutput now (you might change LogFileStreamOutput later)

  • LogOutput
    • LogFileStreamOutput
      • LogFileOutput

I want to add async support to LogFileStreamOutput or LogFileStreamOutput because it helps us if we add other UL output (e.g. network socket) in future.

@tstuefe
Copy link
Member

@tstuefe tstuefe commented Mar 30, 2021

Hi Xin, regrading the VM thread blocking on logs.

If you instead use two arrays, one active and one for flushing, you can swap them with atomic stores from the flushing thread.
And use GlobalCounter::write_synchronize(); to make sure no writer is still using the swapped out array for logging.

The logging thread would use GlobalCounter::critical_section_begin(), atomic inc position to get the spot in the array for the log, store the log and then GlobalCounter::critical_section_end().

That way you will never block a logging thread with the flushing and run enqueues in parallel.

If you want really want smooth logging you could also remove the strdup, since it may cause a syscall to "break in" more memory.
To solve that you could use the arrays as memory instead, and do bump the pointer allocation with an atomic increment to size needed instead of position.

+1. This is what I meant with my strdup() critique. Does the Deque does not also allocate memory for its entries dynamically? If yes, we'd have at least two allocations per log, which I would avoid. I'd really prefer a simple stupid fixed-sized array here (or two, the double buffering Robbin proposed is a nice touch).

As I wrote before, this would make UL also more robust in case we ever want to log low level VM stuff without running into circularities. Ideally, UL should never have relied on VM infrastructure to begin with. That is a design flaw IMHO. UL calling - while logging - into os::malloc makes me deeply uneasy.

I tested a bit locally generally I don't think there is an issue with blocking the VM thread on flushing.
So I'm not really that concern about this, but it's always nice to have an algorithm which is constant time instead. (Neither CS begin()/end() or atomic inc can fail or loop on x86)

@simonis
Copy link
Member

@simonis simonis commented Mar 30, 2021

Thanks everybody for your valuable comments. As requested in the PR, I've just started a new discussion thread on hotspot-dev (with all current reviewers on CC).

Before diving into more discussions about implementation details, I'd first like to:

  1. Reach general consensus that asynchronous logging is a useful feature that's worth while adding to HotSpot.
  2. Agree on the desired properties of asynchronous logging. I've tried to collect a basic set of desired properties in the "Proposed solution" section of that mail.
  3. Discuss various implementation details and finally prepare new pull requests based on the that discussions.

Your comments, suggestions and contributions are highly appreciated.
Thank you and best regards,
Volker

re-introduce the global option AsyncLogging.
@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Apr 3, 2021

Hi Xin, regrading the VM thread blocking on logs.

If you instead use two arrays, one active and one for flushing, you can swap them with atomic stores from the flushing thread.
And use GlobalCounter::write_synchronize(); to make sure no writer is still using the swapped out array for logging.

The logging thread would use GlobalCounter::critical_section_begin(), atomic inc position to get the spot in the array for the log, store the log and then GlobalCounter::critical_section_end().

That way you will never block a logging thread with the flushing and run enqueues in parallel.

If you want really want smooth logging you could also remove the strdup, since it may cause a syscall to "break in" more memory.
To solve that you could use the arrays as memory instead, and do bump the pointer allocation with an atomic increment to size needed instead of position.

I tested a bit locally generally I don't think there is an issue with blocking the VM thread on flushing.
So I'm not really that concern about this, but it's always nice to have an algorithm which is constant time instead. (Neither CS begin()/end() or atomic inc can fail or loop on x86)

hi, Robbin,

Thanks for your comments. I am new for GlobalCounter. please correct me if I am wrong.

critical_section_begin/end() reminds me of Linux kernel's rcu_read_lock/unlck(). Traditionally, the best scenario of RCU is many reads and rare writes. it's because concurrent writers still need to use atomic operations or locking mechanism. Unfortunately, all participants of logging are writers. No one is read-only.

The algorithm you described is appealing. The difficulty is that hotspot GlobalCounter is epoch-based. It's unsafe to swap two buffers until all writers are in quiescent state. One write_synchronize() can't guarantee that. Thinking of that log sites are high-frequent and concurrent, you would see multiple version(epoch) in-flight. Further, the most expensive synchronization is atomic_incrementation. You can't eliminate them.

If we decide to go to a lock-free solution, I think the classic lock-free linkedlist is the best way to do that. I have seen that hotspot recently checks in a lock-free FIFO. TBH, I would rather grumpy customers yell at me about async-logging performance first before jump into lock-free algorithms. Here is why.

  1. it’s easy to implement using a lock. it is easy to be reviewed as well.
  2. log sites should be less contentious. Some are temporally distanced. eg. classloading and gc marking.
  3. aysnc logging is an auxiliary feature. Even the entire logging subsystem is neither critical nor mandatory.

I know that I serialize all file-based logs using a mutex. In my defense, it's probably unusual to log several files in real life. As Volker pointed out earlier, the cost is effectively same as the existing futex imposed by Unified Logging there[1] if we have only one file output.

I like the idea of swapping 2 buffers. In our internal jdk8u, I do use this approach. flusher swaps 2 ring buffers after it dump one[2]. For linkedlist deque, pop_all is essentially swapping 2 linked lists in O(1) complexity. It still uses a mutex, but it pops up all elements at once. The amortized cost is low.

I use the following approach to prevent blocked IO from suspending java threads and vmthread.

  1. flushing is done in a NonJavaThread. It means it can ignore safepoints.
  2. release the lock of buffer before it start doing IO. It won't block anybody from enqueuing.

The main use case would be a synchronous log for everything, but asynchronous log for safepoint and gc.
Recently, I am thinking how to get a consensus of a feature before beat it to death. Look, there's alway room to improve it. I think it's a good idea to focus on the core problem we are solving and sidetrack optimizations for future development.

[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/logging/logFileStreamOutput.cpp#L123
[2] https://cr.openjdk.java.net/~xliu/8229517/webrev04/webrev/src/share/vm/utilities/ostream.cpp.udiff.html

@robehn
Copy link
Contributor

@robehn robehn commented Apr 6, 2021

Hi Xin,

From the algorithm perspective all loggers are readers.
We are protecting the buffer pointer which only the flusher writes to.
To make sure not one still sees the old buffer after the storing of the swapped in one we need to issue a write synchronize.
When the write synchronize finishes we know that the to be flushed buffer is no longer visible.

That the readers of this buffer pointer writes into the buffer doesn't matter since they can only write if they can see the buffer.
When no reader can see the buffer, they can not write to it either.

I do not follow your reasoning on atomic increment.

  • This change-set have:
    { // critical area MutexLocker ml(&_lock, Mutex::_no_safepoint_check_flag); enqueue_impl(m); }
  • A lock free linked list is most often implemented with CAS.

Both CAS and above mutex serialization are more expensive atomic inc.

If you feel that your current code is battle-proven and you think doing additional enhancements as incremental changes is better, please do so. As I said, I don't have any big concern about either performance nor blocking the the VM thread.

Xin Liu added 13 commits Apr 9, 2021
Move LogAsyncFlusher from WatcherThread to a standalone NonJavaThread
https://issues.amazon.com/issues/JVM-565
nmethod::print(outputStream* st) should not obtain tty_lock by assuming
st is defaultStream. It could be logStream as well.

Currently, AyncLogFlusher::_lock has the same rank of tty_lock.
https://issues.amazon.com/issues/JVM-563
…yncLogging"

This reverts commit 81b2a0c.

This problem is sidetracked by JDK-8265102.
Each LogOutput has an independent counter. The out-of-band message
"[number] messages dropped..." will be dumped into its corresponding
LogOutput.
Xin Liu
This patch also supports to add a new output dynamically. If
output_option specifies async=true, the new output will use
asynchronous writing.

Currently jcmd VM.log prohibts users from changing established
output_options in runtime. users can disable them all and then
recreate them with the new output_options.
It was used as a timeout parameter of the monitor. Now the monitor is waked up
when the occupancy of asynclog buffer is more 3/4.
LogAsyncFlusher::_lock ranks Mutex::tty on purpose, which is same as tty_lock.
Ideally, they are orthogonal. In reality, it's possible that a thread emits logs
to a log file while (mistakenly) holding tty_lock. ttyUnlocker is placed in enqueue
member functions to resolve conflicts betwen them.

This patch fixed the jtreg test: runtime/logging/RedefineClasses.java and the
full brunt logging -Xlog:'all:file=hotspot.log::async=true'
I saw intermitent crashes of java with the following arguments.
-Xlog:'all=trace:file=hotspot-x.log:level,tags:filecount=0,async=true' --version

The root cause is that there is a race condition between main thread _exit and
LogAsyncFlusher::run. This patch added a synchronization using
Terminator_lock in LogAsyncFlusher::terminate. This guranatees that no
log entry will emit when main thread is exiting.
Xin Liu
This change makes sure LogAsyncFlusher::abort() is lock-less.
Therefore, it's not subject to mutex rank issue. Newly added
gtest(mutex_lock_access_leaf)  may deliberately trigger SIGSEGV
while holding access rank mutex, so abort() must be lockless.
A lightweight LogDecorationRef is created to keep LogDecorations at
logging site. It uses refcnt to keep track multiple usage and
automatically clean up. If no decorator is in use, no LogDecorationRef
is created.
@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Apr 24, 2021

Hi Xin,

From the algorithm perspective all loggers are readers.
We are protecting the buffer pointer which only the flusher writes to.
To make sure not one still sees the old buffer after the storing of the swapped in one we need to issue a write synchronize.
When the write synchronize finishes we know that the to be flushed buffer is no longer visible.

That the readers of this buffer pointer writes into the buffer doesn't matter since they can only write if they can see the buffer.
When no reader can see the buffer, they can not write to it either.

I do not follow your reasoning on atomic increment.

* This change-set have:
  `  { // critical area MutexLocker ml(&_lock, Mutex::_no_safepoint_check_flag); enqueue_impl(m);   }`

* A lock free linked list is most often implemented with CAS.

Both CAS and above mutex serialization are more expensive atomic inc.

If you feel that your current code is battle-proven and you think doing additional enhancements as incremental changes is better, please do so. As I said, I don't have any big concern about either performance nor blocking the the VM thread.

Hi, Robin,

Now I understand "From the algorithm perspective all loggers are readers". LogOutputList uses the similar idea and I do leverage it when obtaining lock is inappropriate in this PR.

GlobalCounter::write_synchronize() is epoch-based RCU for "safe memory reclamation". It's intriguing to use in a ping-pang buffer and now I see it's correct! I am very interested in it and let me try it out after this PR. I feel sorry that my reflection arc is so long. I have to read and understand a lot of hotspot runtime code, which is not easy for me.

For "the most expensive synchronization is atomic_incrementation." I mean we need to atomic increase the writing pointer of a buffer no matter what. It should be the most expensive "cost". Yes, I acknowledge that atomic operations are cheaper than CAS or mutex.

So far, I am fine with the performance(it's not my intention to writing a "high-performance" logging), but I feel it's quite cumbersome to use Mutex given the fact that its ranking is stiff. GlobalCounter+ping-pang buffer should result in a more concise and efficient implementation.

thanks,
--lx

@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Apr 26, 2021

Integration Test

We understand that a feature is useless or even harmful if it's buggy. In addition to the gtest in this PR, we also set up an integration tests to verify it.

  1. The full-brunt logging in an entire jvm lifecycle
    The following command dump all logs in trace verbosity in async logging mode.
    There are 13k messages and no one is dropped from [0.004s] to [0.087s]

time ./build/linux-x86_64-server-release/images/jdk/bin/java -Xlog:'all=trace:file=hotspot.log:uptime,tid,l,tg:filecount=0,async=true' --version

$wc -l hotspot.log
13386 hotspot.log
$grep "messages dropped" hotspot.log -n
  1. Gclog test
    We directly use the testcase in linkedin whitepaper[1]. Test.java launches N threads and each one churns up java heap by allocating String objects. We use the default gc(G1GC) of jdk17.

The following command runs infinite time with 10g heap. It launches 4 threads and dump gc logs in trace verbosity.

java -Xmx10g -Xms10g -Xlog:gc=trace:file=gc.log::async=true Test -1 4

We have been running it over a week and no message is drop.

  1. Gclog test with dropping messages
    The following command still uses Test.java introduced in prior test, it dumps all logset as long as it contains gc.
    IIUC, it simulates -XX:+PrintGCDetails. It still starts 4 threads and the test ends in 180s.

java -Xmx10g -Xms10g -Xlog:'gc*=trace:file=gc-star.log::async=true' Test 180 4

Log rotation works correctly. Give the verbose log and intense gc activities, we do observe some logs dropped.

In gc-star.log.0, whose timestamp is from [11445.736s] to [11449.168s], it totally drops 4717 messages out of 130224, or 3.6% messages are ditched.

No message is dropped after we enlarge the buffer to -XX:AsyncLogBufferSize=102400.
In debug build, hotspot will dump dropping messages to tty console with -XX:+Verbose.

  1. Gclog test with dynamical reconfiguration
    While running Test 3, we can dynamically reconfigure logs using jcmd VM.log

We use a script to disable all outputs and re-create a file-based output in async mode. We periodic ran the script in a day and no problem has been identified.

  1. Gclog test with NMT
    the following command is same as 3 plus -XX:NativeMemoryTracking=summary

java -Xmx10g -Xms10g -Xlog:'gc*=trace:file=gc-star.log::async=true' -XX:NativeMemoryTracking=summary Test -1 4

We use a script to monitor NMT summary. All offheap allocation are marked "mtLogging". We don't observe memory leak on Logging category in long-haul running.

We also ran asynchorous logging in Minecraft 1.16.5 on both Linux and MacOS. No problem has been captured yet.

[1] https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic

@tstuefe
Copy link
Member

@tstuefe tstuefe commented Apr 29, 2021

Hi @navyxliu ,

sorry for the quiet time, but good work on the testing!

Had a read through your patch. One thing I dislike is the optimization you did for LogDecorations. I see what you did there, but this makes the coding a lot more unreadable while being not a perfect solution.

The fact that LogDecorations are implemented with a fixed sized 256 char buffer is not perfect today even outside the context of async logging. It may still be too small (interestingly, one thing I noticed is that UL does not have standard decorators like "file" and "line" which may exhaust that buffer). While still taking much too much memory on average, especially in a context like yours where you want to store resolved decorations.

I propose to tackle that problem separately though, independently from your patch. One possible solution we have done in the past with similar problems is to delay the string resolution and store the raw data for printing in some kind of vector (similar to a va_arg pointer). But I'd leave this for another day, your patch is complex as it is.

For now I propose one of two things:

  • Either live with the full Decorations and pay the 256char-per-message price
  • Or, just resolve the decorations in the async logger thread, for now clearly document it as "works as designed, decorators (e.g. tid) may be off if async logging is engaged")

and in both cases wait for the follow-up RFE to introduce a better way to persist decorations.

What do you think?

Cheers, Thomas

@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented Apr 30, 2021

While still taking much too much memory on average, especially in a context like yours where you want to store resolved decorations.

Hi, Thomas,

Thank you for reviewing the patch.

I think we should provide accurate log decorations. I am okay to compromise if accuracy comes at expensive expense. My performance results suggest that is not.

When I make the patch "Accurate Decorations for AsyncLogging", you can see that I deliberately avoid from changing existed code. That leaves little room for me to support new features. eg. I can't allocate a LogDecorations object using keyword new. Global new/delete are forbidden in HotSpot. That's why I choose a new class LogDecorationRef. The reason I choose refcnt because I found LogMessageBuffer used a same LogDecorations object while it iterated. A refcnt can save duplication.

I made a new revision which just copys LogDecorations for AsyncLogMessage. My profiling results show that the optimization of refcnt is very limited.

Here is my experiment.

  1. baseline: dump version without any log.

perf stat -r 200 java --version > /dev/null

  1. original: dump all logs in trace verbosity.

perf stat -r 200 java -Xlog:'all=trace:file=hotspot-gold.log:uptime,tid,l,tg:filecount=0' --version

  1. original without decoration: decorators=None

perf stat -r 200 java -Xlog:'all=trace:file=hotspot-none.log:uptime,none:filecount=0' --version

  1. asynclogging refcnt: current

perf stat -r 200 java -XX:+AsyncLogging -Xlog:'all=trace:file=hotspot-async-refcnt.log:uptime,tid,l,tg:filecount=0' --version

  1. asynclogging copy

perf stat -r 200 java -XX:+AsyncLogging -Xlog:'all=trace:file=hotspot-async-copying.log:uptime,tid,l,tg:filecount=0' --version

wall-time(ms)
1 baseline 30.928
2 original 98.204
3 original-none 84.624
4 async-refcnt 82.632
5 async-copying 83.71

I compare the generated log files, async logging generates the same result as Original's.

original is synchronous logs. async-refcnt does show the best results, which is 15.8% faster compared to original. It takes benefits from a standalone thread. async-copying is my latest patch which just copys LogDecoratons once for each message. The performance is very close but the implementation is much simpler.

original-none shows that decorations take a small partition of runtime. The main overhead is still the log messages. Copying it isn't expensive. The benefit of multi-threading can overshadow it. Is that okay I revert my refcnt version and replace it with copying solution.

xliu added 2 commits Apr 29, 2021
LogDecorations defines a copy constructutor. Each log message copys it
once.
The old options -XX:+AsyncLogging and the output option are both removed.
This patch also deliver the zero-cost promise. ie. No AsyncLog thread will be
launched if async_mode is not eanbled.
@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented May 4, 2021

Hi, Reviewers,

May I ask to review this PR? I make it comply with the CSR. A global flag -Xlog:async will enable async logging.
By default, this flag is off. Now it's pure zero-cost by default. HotSpot even doesn't launch asynclog thread.

thanks,
--lx

@tstuefe
Copy link
Member

@tstuefe tstuefe commented May 4, 2021

Hi Xin,

I put #3855 up for review. Please take a look at it. It makes LogDecorations trivially copy-able, and it reduces its size by about a quarter too. With this patch I believe you can do without your new LogDecorations copy constructor as well as the deleted assignment op since the default copy constructor and assignment operator should work fine now.

I do not understand why you make the _tagset member in LogDecorations optional. Could you explain this?

Cheers, Thomas

@navyxliu
Copy link
Contributor Author

@navyxliu navyxliu commented May 5, 2021

I do not understand why you make the _tagset member in LogDecorations optional. Could you explain this?

The reason I change _tagset from reference to pointer because I would like tosupport tagset is NULL.
This is a corner case. When I inject meta-information, I don't have a appropriate tagset for the log message.
[(null) ] is the result from LogDecorations with a NULL tagset.

[24.933s][warning][(null)               ]   1983 messages dropped...
[24.943s][warning][(null)               ]   1424 messages dropped...
[24.953s][warning][(null)               ]   2067 messages dropped...
[24.964s][warning][(null)               ]   2466 messages dropped...
[24.972s][warning][(null)               ]   1635 messages dropped...

here is the relevant code snippet.

bool AsyncLogMapIterator::do_entry(LogFileOutput* output, uintx* counter) {
  LogDecorators decorators = output->decorators();
  decorators.without(LogDecorators::tags_decorator);
  LogDecorations decorations(LogLevel::Warning, decorators);
  const int sz = 128;
  char out_of_band[sz];

  if (*counter > 0) {
    jio_snprintf(out_of_band, sz, UINTX_FORMAT_W(6) " messages dropped...", *counter);
    output->write_blocking(decorations, out_of_band);
    *counter = 0;
  }

  return true;
}
@tstuefe
Copy link
Member

@tstuefe tstuefe commented May 5, 2021

I do not understand why you make the _tagset member in LogDecorations optional. Could you explain this?

The reason I change _tagset from reference to pointer because I would like tosupport tagset is NULL.
This is a corner case. When I inject meta-information, I don't have a appropriate tagset for the log message.
[(null) ] is the result from LogDecorations with a NULL tagset.

[24.933s][warning][(null)               ]   1983 messages dropped...
[24.943s][warning][(null)               ]   1424 messages dropped...
[24.953s][warning][(null)               ]   2067 messages dropped...
[24.964s][warning][(null)               ]   2466 messages dropped...
[24.972s][warning][(null)               ]   1635 messages dropped...

here is the relevant code snippet.

bool AsyncLogMapIterator::do_entry(LogFileOutput* output, uintx* counter) {
  LogDecorators decorators = output->decorators();
  decorators.without(LogDecorators::tags_decorator);
  LogDecorations decorations(LogLevel::Warning, decorators);
  const int sz = 128;
  char out_of_band[sz];

  if (*counter > 0) {
    jio_snprintf(out_of_band, sz, UINTX_FORMAT_W(6) " messages dropped...", *counter);
    output->write_blocking(decorations, out_of_band);
    *counter = 0;
  }

  return true;
}

Ah, I get it. I thought LogTagSet is a bitmask... why would it not be a bitmask? Again we pay for a full array here. Plus an arbitrary limitation to 5 tags.

Well, I may not know the full story.

I see what you mean and this makes sense, but I would prefer to modify tagset to allow it to be empty and have zero tags. Modifying LogDecorations to deal with something which should be done in LogTagSet just perpetuates the complexity.

Alternatively, you could just add a placeholder tag for "this set has no tags" and we leave the LogTagSet improvement for later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
6 participants