-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JDK-8299415: Saturate the async log buffer more often #11804
Conversation
👋 Welcome back jsjolen! A progress list of the required criteria for merging this PR into |
@jdksjolen The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
This is probably fairly weak proof of this, and a day later I'm not so sure that this is true :-). Still, the change is good imho! |
I think your change of fraction_used() is reasonable. We don't need to notify the flush thread in very early stage. As long as 0.3 doesn't cause overflow for |
@@ -174,7 +176,7 @@ void AsyncLogWriter::run() { | |||
AsyncLogLocker locker; | |||
|
|||
while (!_data_available) { | |||
_lock.wait(0/* no timeout */); | |||
_lock.wait(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this increase context switch? In particular, hotspot may not enable unified log at all. The flush thread wakes up every second for nothing?
It seemed reasonable that waiting 1 second for your sole log message is acceptable
Could you elaborate this explanation?
I understand that we somehow need to flush the queue if the log stream is end and fraction_used() is still lower than 0.3. I think we can remember a historic fraction_used. As long as this value increases, it suggests that logs still trickle in. we don't need to worry about it if so.
Using this algorithm, I think we can make even bigger threshold. The reasonable threshold is up to the speed of IO. In reality, the speed of IO is not known until runtime. I think we can make it adaptive in the future. if we detect overflow, we lower the threshold down to 0.0.
Yes, it increases context switching. If Hotspot hasn't enabled
The reasoning for the timeout was two fold:
Let's say that a historic fraction is used, then where would the comparison be placed? The only place I can think of is the
It's reasonable for this RFE to lower the threshold directly down to 0.0 if overflow is detected, I agree that it's a good idea for a future RFE to implement something more dynamic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I am not a reviewer. we still need other reviewers to approve it.
hi, @jdksjolen
Fair enough. One corner case is that users declare '-Xlog:async' but don't provide any '-Xlog:' selections. They may postpone to select logs in runtime using dcmd. We can ignore that because the compromise is acceptable.
I agree. No matter what, we can't put the flush thread in sleep without a timeout argument. We can't flush holding messages until termination. The fixed timeout 1s isn't perfect but I think it's reasonable. Your patch looks good to me. |
There is obviously a spectrum of changes possible based on how big a buffer is used and how often it is flushed - can the user already control these two things? If yes then I don't see a justification to change the current behaviour. If no then surely that is what is needed? |
hi, @dholmes-ora Current implementation lacks the parameter of "how often it is flushed". Logsites notify the flush thread for every single message and it is the most conservative way. The general idea is to flush as many messages as possible each time. This amortizes the cost of synchronization and external IO. I think this patch is toward this goal. |
The premise of this patch was "the buffer is being under-utilised", so it delays flushing so that the buffer gets more full. This seems an arbitrary selection - we could have made the buffer smaller instead. I'm not seeing a rationale for the current change that justifies changing the behaviour for all users in a way that they cannot compensate for. If buffer utilisation is a metric of interest (and I'm not convinced it is) then we should be providing a way to control the buffer size (already there) and the flush rate. |
Hi David, I was surprised to hear that your intuition regarding this went against my own. So, that caused me to re-do the measurements and I realised I made a big mistake: I ran the tests without optimisations (slowdebug). I then decided to re-run the measurements with an optimised build (no The results of those measurements showed that With that in mind, maybe what should be looked into isn't this patch, but rather replacing the buffer with a lock-free circular one (for example, maybe there are better fits). That is, if we want better performance. @navyxliu , maybe we should avoid merging this patch as it leads to increased complexity while there are other measures that could give much bigger gains? |
hi, @jdksjolen I still have a task in my backlog. I plan to compress Yes, lock-free circular is better, but I am not sure we have a scenario like Linux kernel printk, which is always highly contented in kernel threads. If we ignore the extreme case such as thanks, |
Hi,
This change introduces a heuristic in order to ensure that the asynchronous logging buffer is saturated more often. Before this change the logger often wrote single messages to the output, now it outputs singificantly more data at a time. I also believe that less uses of
notify()
means that the performance implication of using asynchronous logging is improved. Runningstat perf
consistently shows a smaller amount of context switches when run with this change, which supports this hypothesis.The asynchronous logging tests was run and passed with:
The choice of a ratio of 30% and a timeout of 1 second isn't based on data. It seemed reasonable that waiting 1 second for your sole log message is acceptable and that 30% is enough to not cause any message drops.
Data
All invocations of the JVM were run with
-Xlog:async -Xlog:all=trace:file=logs.txt
As a measurement of buffer usage I logged the current number of messages and size of the buffer each time
AsyncLogWriter::write()
. Below I present how oftenAsyncLogWriter::write()
was called with a certain number of messages in the buffer. For example,1: 11
is read "AsyncLogWriter::write()
was called with 1 message in the buffer 11 times".Before:
After
I ran the following command multiple times for gaining CPU counter information:
sudo perf stat -ddd ./build/linux-x64/jdk/bin/java -Xlog:async -Xlog:all=trace:file=logz.txt
. A before and after sample is provided below. See context-switches in particular.Before:
After:
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/11804/head:pull/11804
$ git checkout pull/11804
Update a local copy of the PR:
$ git checkout pull/11804
$ git pull https://git.openjdk.org/jdk pull/11804/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 11804
View PR using the GUI difftool:
$ git pr show -t 11804
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/11804.diff