8229517: Support for optional asynchronous/buffered logging #3135
This patch provides a buffer to store asynchrounous messages and flush them to
Add an option async for the file-based outputs. The option conforms to output-option in JEP-158.
fix a warning from c++11: "The generation of the implicitly-defined copy assignment operator is deprecated if T has a user-declared destructor or user-declared copy constructor."
I would like to restart the RFR process for the feature async logging. We (AWS) have deployed this feature over a year in a few critical services. It helps us to reduce long-tail GC pauses. On Linux, we used to experience intermittent second-level delays due to gclog writings. If those undesirable events happen to appear at safepoints, hotspot has to prolong the pause intervals, which then increase the response time of Java application/service.
Originally, we observed and solved this issue on a Linux system with software RAID. In absence of hardware assistance, multiple writes have to be synchronized and it is that operation that yields long pauses. This issue may become more prevalent if Linux servers adopt ZFS in the future. We don’t think redirecting log files to tmpfs is a final solution. Hotspot should provide a self-contained and cross-platform solution. Our solution is to provide a buffer and flush it in a standalone thread periodically.
Since then, we found more unexpected but interesting scenarios. e.g. some cloud-based applications run entirely on a AWS EBS partition. syscall
Those pain points are not AWS-exclusive. We found relevant questions on stackoverflow and it seems that J9 provides an option
Back to implementation, this is the 2nd revision based on Unified logging. Previous RFR was a top-down design. We provide a parallel header file
May we know more about LogMessageBuffer.hpp/cpp? We haven’t found a real use of it. That’s why we are hesitating to support LogFileOutput::write(LogMessageBuffer::Iterator msg_iterator). Further, we haven’t supported async_mode for LogStdoutOutput and LogStderrOutput either. It’s not difficult but needs to big code change.
YaSuenag left a comment
I think this PR is very useful for us!
Thank you for providing the stacktrace! I didn't notice <logMessage.hpp> until you point out. Now I understand the rationale and usecases of logMessageBuffer. Let me figure out how to support it.
IIUC, the most important attribute of
I skimmed over the patch, but have a number of high level questions - things which have not been clear from your description.
Update: Okay, I see you use PeriodicTask and the WatcherThread. Is this really enough? I would be concerned that it either runs too rarely to be able to swallow all output or that it runs that often that it monopolizes the WatcherThread.
I actually expected a separate Thread - or multiple, one per output - for this, waking up when there is something to write. That would also be more efficient than constant periodic polling.
I think this feature could be useful. I am a bit concerned with the increased complexity this brings. UL is already a very (I think unnecessarily) complex codebase. Maybe we should try to reduce its complexity first before adding new features to it. This is just my opinion, lets see what others think.
This is flushed by the watcher thread (non-JavaThread).
We can also change log-configuration during run-time, e.g. turn on/off logs via jcmd.
Thank you for reviewing this PR.
The WatchThread eventually flushes those buffered messages. if the writing stalls, it blocks periodic tasks.
The capacity of buffer is limited, which is
If the buffer overflows, it starts dropping the heads. this behavior simulates a ringbuffer.
I prefer to drop messages than keeping them growing because later may trigger out-of-memory error.
The interval is defined by
I have tuned parameters that it won't drop messages easily for normal GC activity with info verbosity.
Yes, it works if you have multiple outputs.
so far, LogAsyncFlusher as a periodic task remains active even no output is in async_mode.
You concern is reasonable. I don't understand why there is only one watchThread and up to 10 periodic tasks are crowded in it.
Can we treat it as a separated task? for normal usage, I think the delay is quite managed. Writing thousands of lines to a file usually can be done in sub-ms.
IMHO, logging shouldn't hurt performance a lot. At least, those do impact on performance are not supposed to enable by default. On the other side, I hope logging messages from other threads avoid from interweaving when I enable them to read.
My design target is non-blocking. pop_all() is an ad-hoc operation which pop up all elements and release the mutex immediately. writeback() does IO without it.
In our real applications, we haven't seen this feature downgrade GC performance yet.
I believe UL has its own reasons. In my defense, I don't make UL more complex. I only changed a couple of lines in one of its implementation file(logFileOutput.cpp) and didn't change its interfaces.
tstuefe left a comment
thank you for your detailed answers.
As I wrote, I think this is a useful change. A prior design discussion with a rough sketch would have made things easier. Also, it would have been good to have the CSR discussion beforehand, since it affects how complex the implementation needs to be. I don't know whether there had been design discussions beforehand; if I missed them, I apologize.
I am keenly aware that design discussions often lead nowhere because no-one answers. So I understand why you started with a patch.
About your proposal:
I do not think it can be made airtight, and I think that is okay - if we work with a limited flush buffer and we log too much, things will get dropped, that is unavoidable. But it has to be reliable and comprehensible after the fact.
As you write, the patch you propose works well with AWS, but I suspect that is an environment with limited variables, and outside use of the VM could be much more diverse. We must make sure to roll out only well designed solutions which work for us all.
E.g. a log system which randomly omits log entries because some internal buffer is full without giving any indication in the log itself is a terrible idea :). Since log files are a cornerstone for our support, I am interested in a good solution.
First off, the CSR:
Do we really need that much freedom? How probable is that someone wants different async options for different trace sinks? The more freedom we have here the more complex the implementation gets. All that stuff has to be tested. Why not just make "async" a global setting.
The use of the WatcherThread and PeriodicTask. Polling is plain inefficient, beside the concerns Robbin voiced about blocking things. This is a typical producer-consumer problem, and I would implement it using an own dedicated flusher thread and a monitor. The flusher thread should wake only if there is something to write. This is something I would not do in a separate RFE but now. It would also disarm any arguments against blocking the WatcherThread.
The fact that every log message gets strduped could be done better. This can be left for a future RFE - but it explains why I dislike "AsyncLogBufferSize" being "number of entries" instead of a memory size.
I think processing a memory-size AsyncLogBufferSize can be kept simple: it would be okay to just guess an average log line length and go with that. Lets say 256 chars. An AsyncLogBufferSize=1M could thus be translated to 4096 entries in your solution. If the sum of all 4096 allocated lines overshoots 1M from time to time, well so be it.
A future better solution could use a preallocated fixed sized buffer. There are two ways to do this, the naive but memory inefficient way - array of fixed sized text slots like the event system does. And a smart way: a ring buffer of variable sized strings, '\0' separated, laid out one after the other in memory. The latter is a bit more involved, but can be done, and it would be fast and very memory efficient. But as I wrote, this is an optimization which can be postponed.
I may misunderstand the patch, but do you resolve decorators when the flusher is printing? Would this not distort time-dependent decorators (timemillis, timenanos, uptime etc)? Since we record the time of printing, not the time of logging?.
If yes, it may be better to resolve the message early and just store the plain string and print that. Basically this would mean to move the whole buffering down a layer or two right at where the raw strings get printed. This would be vastly simplified if we abandon the "async for every trace sink" notion in favor of just a global flag.
This would also save a bit of space, since we would not have to carry all the meta information in
Please find further remarks inline.
If the flusher blocks, this could block VM shutdown? Would this be different from what we do now, e.g. since all log output is serialized and done by one thread? Its probably fine, but we should think about this.
The question was how we handle multiple trace sinks, see my "CSR" remarks.
Since you use a mutex it introduces synchronization, however short, across all logging threads. So it influences runtime behavior. For the record, I think this is okay; maybe a future RFE could improve this with a lockless algorithm. I just wanted to know if you measured anything, and I was curious whether there is a difference now between synchronous and asynchronous logging.
(Funnily, asynchronous logging is really more synchronous in a sense, since it synchronizes all logging threads across a common resource).
I understand. Its fine to do this in a later RFE.
LogMessage supports async_mode. remove the option AsyncLogging renanme the option GCLogBufferSize to AsyncLogBufferSize move drop_log() to LogAsyncFlusher.
On 27/03/2021 5:30 pm, Thomas Stuefe wrote:
IMO the discussions last year left it still an open question whether
I'm piggy-backing on some of Thomas's comments below.
Truly global or global for all actual file-based logging? I think
I'm not sure it should be a bounded size at all. I don't like the idea
The logging interval should be configurable IMO, so it either needs a
I agree with Thomas here. Using the WatcherThread for this is not really
If we had had async logging from day one then the way we construct log
As I started with, I think there needs to be a return to the high level
Note that I am really in favor of bringing async logging to UL; this issue bopped up again and again, brought in various forms by various people. It will be good to finally tackle this.
But I agree that talking about the design first would be helpful. Maybe have a little mailing list thread to stop polluting this PR?
I posted similar diacussion to hotspot-runtime-dev last November. It aims to implement to send UL via network socket. I believe this PR helps it.
Interesting. This design diagram is similar to this PR, but I don't think it is a good idea to have a blocking message buffer.
Design and its Rationale
For async-logging feature, we proposed a lossy non-blocking design here. A bounded deque or ringbuffer gives a strong guarantee that log sites won't block java threads and the critical internal threads. This is the very problem we are meant to solve.
It can be proven that we cannot have all three guarantees at the same time: non-blocking, bounded memory and log fidelity. To overcome blocking I/O, which sometimes is not under our control, we think it's fair to trade log fidelity for non-blocking. If we kept fidelity and chose unbound buffer, we could end up with some spooky out-of-memory errors on some resource-constrained hardwares. We understand that the platforms hotspot running range from powerful servers to embedded devices. By leaving the buffer size adjustable, we can fit more scenarios. Nevertheless, with a bounded buffer, we believe developers can still capture important logging traits as long as the window is big enough and log messages are consecutive. The current implementation does provide those two merits.
A new proposal based on current implementation
I agree with reviewers' comments above. It's questionable to use the singleton
Just like Yasumasa depicted, I can create a dedicated NonJavaThread to flush logs instead. Yesterday, I found
Wrap it up
We would like to propose a lossy design of async-logging in this PR. It is a trade off, so I don't think it's a good idea to handle all logs in async mode. In practice, we hope people only choose
I understand Yasumasa's problem. If you would like to consider netcat or nfs/sshfs, I think your problem can still be solved in the existing file-based output. In this way, you can also utilize this feature by setting your "file" output async mode, then it makes your hotspot non-blocking over TCP as well.
This proposal mostly looks good to me, but it is better if async support is implement in higher level class.
I want to add async support to LogFileStreamOutput or LogFileStreamOutput because it helps us if we add other UL output (e.g. network socket) in future.
+1. This is what I meant with my strdup() critique. Does the Deque does not also allocate memory for its entries dynamically? If yes, we'd have at least two allocations per log, which I would avoid. I'd really prefer a simple stupid fixed-sized array here (or two, the double buffering Robbin proposed is a nice touch).
As I wrote before, this would make UL also more robust in case we ever want to log low level VM stuff without running into circularities. Ideally, UL should never have relied on VM infrastructure to begin with. That is a design flaw IMHO. UL calling - while logging - into os::malloc makes me deeply uneasy.
Thanks everybody for your valuable comments. As requested in the PR, I've just started a new discussion thread on hotspot-dev (with all current reviewers on CC).
Before diving into more discussions about implementation details, I'd first like to:
Your comments, suggestions and contributions are highly appreciated.
Thanks for your comments. I am new for GlobalCounter. please correct me if I am wrong.
critical_section_begin/end() reminds me of Linux kernel's rcu_read_lock/unlck(). Traditionally, the best scenario of RCU is many reads and rare writes. it's because concurrent writers still need to use atomic operations or locking mechanism. Unfortunately, all participants of logging are writers. No one is read-only.
The algorithm you described is appealing. The difficulty is that hotspot GlobalCounter is epoch-based. It's unsafe to swap two buffers until all writers are in quiescent state. One
If we decide to go to a lock-free solution, I think the classic lock-free linkedlist is the best way to do that. I have seen that hotspot recently checks in a lock-free FIFO. TBH, I would rather grumpy customers yell at me about async-logging performance first before jump into lock-free algorithms. Here is why.
I know that I serialize all file-based logs using a mutex. In my defense, it's probably unusual to log several files in real life. As Volker pointed out earlier, the cost is effectively same as the existing futex imposed by Unified Logging there if we have only one file output.
I like the idea of swapping 2 buffers. In our internal jdk8u, I do use this approach. flusher swaps 2 ring buffers after it dump one. For linkedlist deque, pop_all is essentially swapping 2 linked lists in O(1) complexity. It still uses a mutex, but it pops up all elements at once. The amortized cost is low.
I use the following approach to prevent blocked IO from suspending java threads and vmthread.
The main use case would be a synchronous log for everything, but asynchronous log for safepoint and gc.
From the algorithm perspective all loggers are readers.
That the readers of this buffer pointer writes into the buffer doesn't matter since they can only write if they can see the buffer.
I do not follow your reasoning on atomic increment.
Both CAS and above mutex serialization are more expensive atomic inc.
If you feel that your current code is battle-proven and you think doing additional enhancements as incremental changes is better, please do so. As I said, I don't have any big concern about either performance nor blocking the the VM thread.
Move LogAsyncFlusher from WatcherThread to a standalone NonJavaThread https://issues.amazon.com/issues/JVM-565
nmethod::print(outputStream* st) should not obtain tty_lock by assuming st is defaultStream. It could be logStream as well. Currently, AyncLogFlusher::_lock has the same rank of tty_lock. https://issues.amazon.com/issues/JVM-563
…yncLogging" This reverts commit 81b2a0c. This problem is sidetracked by JDK-8265102.
Each LogOutput has an independent counter. The out-of-band message "[number] messages dropped..." will be dumped into its corresponding LogOutput.
This patch also supports to add a new output dynamically. If output_option specifies async=true, the new output will use asynchronous writing. Currently jcmd VM.log prohibts users from changing established output_options in runtime. users can disable them all and then recreate them with the new output_options.
LogAsyncFlusher::_lock ranks Mutex::tty on purpose, which is same as tty_lock. Ideally, they are orthogonal. In reality, it's possible that a thread emits logs to a log file while (mistakenly) holding tty_lock. ttyUnlocker is placed in enqueue member functions to resolve conflicts betwen them. This patch fixed the jtreg test: runtime/logging/RedefineClasses.java and the full brunt logging -Xlog:'all:file=hotspot.log::async=true'
I saw intermitent crashes of java with the following arguments. -Xlog:'all=trace:file=hotspot-x.log:level,tags:filecount=0,async=true' --version The root cause is that there is a race condition between main thread _exit and LogAsyncFlusher::run. This patch added a synchronization using Terminator_lock in LogAsyncFlusher::terminate. This guranatees that no log entry will emit when main thread is exiting.
Now I understand "From the algorithm perspective all loggers are readers".
For "the most expensive synchronization is atomic_incrementation." I mean we need to atomic increase the writing pointer of a buffer no matter what. It should be the most expensive "cost". Yes, I acknowledge that atomic operations are cheaper than CAS or mutex.
So far, I am fine with the performance(it's not my intention to writing a "high-performance" logging), but I feel it's quite cumbersome to use Mutex given the fact that its ranking is stiff. GlobalCounter+ping-pang buffer should result in a more concise and efficient implementation.
We understand that a feature is useless or even harmful if it's buggy. In addition to the gtest in this PR, we also set up an integration tests to verify it.
The following command runs infinite time with 10g heap. It launches 4 threads and dump gc logs in trace verbosity.
We have been running it over a week and no message is drop.
Log rotation works correctly. Give the verbose log and intense gc activities, we do observe some logs dropped.
In gc-star.log.0, whose timestamp is from [11445.736s] to [11449.168s], it totally drops 4717 messages out of 130224, or 3.6% messages are ditched.
No message is dropped after we enlarge the buffer to -XX:AsyncLogBufferSize=102400.
We use a script to disable all outputs and re-create a file-based output in async mode. We periodic ran the script in a day and no problem has been identified.
We use a script to monitor NMT summary. All offheap allocation are marked "mtLogging". We don't observe memory leak on Logging category in long-haul running.
We also ran asynchorous logging in Minecraft 1.16.5 on both Linux and MacOS. No problem has been captured yet.
Hi @navyxliu ,
sorry for the quiet time, but good work on the testing!
Had a read through your patch. One thing I dislike is the optimization you did for LogDecorations. I see what you did there, but this makes the coding a lot more unreadable while being not a perfect solution.
The fact that LogDecorations are implemented with a fixed sized 256 char buffer is not perfect today even outside the context of async logging. It may still be too small (interestingly, one thing I noticed is that UL does not have standard decorators like "file" and "line" which may exhaust that buffer). While still taking much too much memory on average, especially in a context like yours where you want to store resolved decorations.
I propose to tackle that problem separately though, independently from your patch. One possible solution we have done in the past with similar problems is to delay the string resolution and store the raw data for printing in some kind of vector (similar to a va_arg pointer). But I'd leave this for another day, your patch is complex as it is.
For now I propose one of two things:
and in both cases wait for the follow-up RFE to introduce a better way to persist decorations.
What do you think?
Thank you for reviewing the patch.
I think we should provide accurate log decorations. I am okay to compromise if accuracy comes at expensive expense. My performance results suggest that is not.
When I make the patch "Accurate Decorations for AsyncLogging", you can see that I deliberately avoid from changing existed code. That leaves little room for me to support new features. eg. I can't allocate a LogDecorations object using keyword
I made a new revision which just copys LogDecorations for AsyncLogMessage. My profiling results show that the optimization of refcnt is very limited.
Here is my experiment.
I compare the generated log files, async logging generates the same result as Original's.
LogDecorations defines a copy constructutor. Each log message copys it once.
May I ask to review this PR? I make it comply with the CSR. A global flag
I put #3855 up for review. Please take a look at it. It makes LogDecorations trivially copy-able, and it reduces its size by about a quarter too. With this patch I believe you can do without your new LogDecorations copy constructor as well as the deleted assignment op since the default copy constructor and assignment operator should work fine now.
I do not understand why you make the _tagset member in LogDecorations optional. Could you explain this?
The reason I change _tagset from reference to pointer because I would like tosupport tagset is NULL.
here is the relevant code snippet.
Ah, I get it. I thought LogTagSet is a bitmask... why would it not be a bitmask? Again we pay for a full array here. Plus an arbitrary limitation to 5 tags.
Well, I may not know the full story.
I see what you mean and this makes sense, but I would prefer to modify tagset to allow it to be empty and have zero tags. Modifying LogDecorations to deal with something which should be done in LogTagSet just perpetuates the complexity.
Alternatively, you could just add a placeholder tag for "this set has no tags" and we leave the LogTagSet improvement for later.