Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK-8299415: Saturate the async log buffer more often #11804

Closed
wants to merge 2 commits into from

Conversation

jdksjolen
Copy link
Contributor

@jdksjolen jdksjolen commented Dec 29, 2022

Hi,

This change introduces a heuristic in order to ensure that the asynchronous logging buffer is saturated more often. Before this change the logger often wrote single messages to the output, now it outputs singificantly more data at a time. I also believe that less uses of notify() means that the performance implication of using asynchronous logging is improved. Running stat perf consistently shows a smaller amount of context switches when run with this change, which supports this hypothesis.

The asynchronous logging tests was run and passed with:

_JAVA_OPTIONS='-Xlog:async' make CONF=linux-x64 test TEST='gtest:*Async*'

The choice of a ratio of 30% and a timeout of 1 second isn't based on data. It seemed reasonable that waiting 1 second for your sole log message is acceptable and that 30% is enough to not cause any message drops.

Data

All invocations of the JVM were run with -Xlog:async -Xlog:all=trace:file=logs.txt

As a measurement of buffer usage I logged the current number of messages and size of the buffer each time AsyncLogWriter::write(). Below I present how often AsyncLogWriter::write() was called with a certain number of messages in the buffer. For example, 1: 11 is read "AsyncLogWriter::write() was called with 1 message in the buffer 11 times".

Before:

message_count : occurrence_count
1 : 11
2 : 6
3 : 294
4 : 8
5 : 10
6 : 47
7 : 4
8 : 1
9 : 28
10 : 2
11 : 2
12 : 14
15 : 5
18 : 3
19 : 1
20 : 1
21 : 1
22 : 1
26 : 2
27 : 3
30 : 2
33 : 1
36 : 1
37 : 1
50 : 1
62 : 1
63 : 1
75 : 1
207 : 1
231 : 1
1395 : 1
2724 : 1
2789 : 1
3212 : 1
3278 : 1
3503 : 1
4069 : 1
4231 : 1
4305 : 1
4712 : 1
5640 : 1
5864 : 1
5912 : 1
5943 : 1
6096 : 1
6144 : 1
6204 : 1
6220 : 1
6222 : 1
6226 : 1
6268 : 1
6278 : 1
6302 : 1
6309 : 1
6339 : 1
6343 : 1
6369 : 1
6379 : 1
6439 : 1
6472 : 1
6516 : 1
6602 : 1
6714 : 1
6834 : 1
7078 : 1
7475 : 1

Call count Sum: 491

After

2002 : 1
2572 : 1
2821 : 1
3279 : 1
5003 : 1
5226 : 1
5749 : 1
5771 : 1
5803 : 1
5979 : 1
6011 : 1
6103 : 1
6140 : 1
6201 : 1
6252 : 1
6265 : 1
6266 : 1
6271 : 1
6288 : 1
6336 : 1
6360 : 1
6373 : 1
6451 : 1
6473 : 1
6515 : 1
6582 : 1
6600 : 1
6774 : 1
7228 : 1

Call count sum: 29

I ran the following command multiple times for gaining CPU counter information: sudo perf stat -ddd ./build/linux-x64/jdk/bin/java -Xlog:async -Xlog:all=trace:file=logz.txt. A before and after sample is provided below. See context-switches in particular.

Before:

                 5 568,53 msec task-clock                #    2,852 CPUs utilized
             1 481      context-switches          #  265,959 /sec
               139      cpu-migrations            #   24,962 /sec
            17 148      page-faults               #    3,079 K/sec
     7 492 468 406      cycles                    #    1,346 GHz                      (39,34%)
    11 820 424 570      instructions              #    1,58  insn per cycle           (46,99%)
     2 497 513 512      branches                  #  448,505 M/sec                    (54,60%)
        23 018 472      branch-misses             #    0,92% of all branches          (62,21%)
    26 465 666 080      slots                     #    4,753 G/sec                    (69,82%)
     9 638 436 928      topdown-retiring          #     33,4% retiring                (69,82%)
    10 041 432 678      topdown-bad-spec          #     34,8% bad speculation         (69,82%)
     8 310 642 872      topdown-fe-bound          #     28,8% frontend bound          (69,82%)
       879 846 428      topdown-be-bound          #      3,0% backend bound           (69,82%)
     3 339 457 604      L1-dcache-loads           #  599,702 M/sec                    (69,38%)
        92 744 182      L1-dcache-load-misses     #    2,78% of all L1-dcache accesses  (69,35%)
         5 575 831      LLC-loads                 #    1,001 M/sec                    (69,32%)
         1 173 449      LLC-load-misses           #   21,05% of all LL-cache accesses  (69,30%)
   <not supported>      L1-icache-loads
       339 072 313      L1-icache-load-misses                                         (30,94%)
     3 319 520 527      dTLB-loads                #  596,121 M/sec                    (31,19%)
           637 308      dTLB-load-misses          #    0,02% of all dTLB cache accesses  (31,38%)
   <not supported>      iTLB-loads
         3 026 652      iTLB-load-misses                                              (31,65%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

       1,952442620 seconds time elapsed

       4,277701000 seconds user
       1,303833000 seconds sys

After:

          5 344,15 msec task-clock                #    2,867 CPUs utilized
               994      context-switches          #  185,998 /sec
               101      cpu-migrations            #   18,899 /sec
            16 902      page-faults               #    3,163 K/sec
     7 220 944 481      cycles                    #    1,351 GHz                      (39,36%)
    11 277 945 316      instructions              #    1,56  insn per cycle           (47,12%)
     2 371 729 681      branches                  #  443,799 M/sec                    (54,73%)
        22 683 816      branch-misses             #    0,96% of all branches          (62,18%)
    24 913 623 884      slots                     #    4,662 G/sec                    (69,76%)
     9 354 630 464      topdown-retiring          #     34,6% retiring                (69,76%)
     9 339 435 252      topdown-bad-spec          #     34,5% bad speculation         (69,76%)
     7 563 625 485      topdown-fe-bound          #     27,9% frontend bound          (69,76%)
       815 220 376      topdown-be-bound          #      3,0% backend bound           (69,76%)
     3 209 639 698      L1-dcache-loads           #  600,589 M/sec                    (69,62%)
        89 008 704      L1-dcache-load-misses     #    2,77% of all L1-dcache accesses  (69,55%)
         5 369 924      LLC-loads                 #    1,005 M/sec                    (69,16%)
         1 093 971      LLC-load-misses           #   20,37% of all LL-cache accesses  (69,32%)
   <not supported>      L1-icache-loads
       323 572 438      L1-icache-load-misses                                         (30,75%)
     3 096 980 739      dTLB-loads                #  579,509 M/sec                    (31,33%)
           616 949      dTLB-load-misses          #    0,02% of all dTLB cache accesses  (31,46%)
   <not supported>      iTLB-loads
         2 741 468      iTLB-load-misses                                              (31,51%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

       1,864260247 seconds time elapsed

       4,129709000 seconds user
       1,225694000 seconds sys

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8299415: Saturate the async log buffer more often

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/11804/head:pull/11804
$ git checkout pull/11804

Update a local copy of the PR:
$ git checkout pull/11804
$ git pull https://git.openjdk.org/jdk pull/11804/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 11804

View PR using the GUI difftool:
$ git pr show -t 11804

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/11804.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 29, 2022

👋 Welcome back jsjolen! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 29, 2022
@openjdk
Copy link

openjdk bot commented Dec 29, 2022

@jdksjolen The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Dec 29, 2022
@mlbridge
Copy link

mlbridge bot commented Dec 29, 2022

Webrevs

@jdksjolen
Copy link
Contributor Author

also believe that less uses of notify() means that the performance implication of using asynchronous logging is improved. Running stat perf consistently shows a smaller amount of context switches when run with this change, which supports this hypothesis.

This is probably fairly weak proof of this, and a day later I'm not so sure that this is true :-). Still, the change is good imho!

@navyxliu
Copy link
Member

I think your change of fraction_used() is reasonable. We don't need to notify the flush thread in very early stage. As long as 0.3 doesn't cause overflow for -Xlog:all=trace, it's good for me. Your histogram and ctx switch counter manifests its effectiveness.

@@ -174,7 +176,7 @@ void AsyncLogWriter::run() {
AsyncLogLocker locker;

while (!_data_available) {
_lock.wait(0/* no timeout */);
_lock.wait(1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this increase context switch? In particular, hotspot may not enable unified log at all. The flush thread wakes up every second for nothing?

It seemed reasonable that waiting 1 second for your sole log message is acceptable

Could you elaborate this explanation?

I understand that we somehow need to flush the queue if the log stream is end and fraction_used() is still lower than 0.3. I think we can remember a historic fraction_used. As long as this value increases, it suggests that logs still trickle in. we don't need to worry about it if so.

Using this algorithm, I think we can make even bigger threshold. The reasonable threshold is up to the speed of IO. In reality, the speed of IO is not known until runtime. I think we can make it adaptive in the future. if we detect overflow, we lower the threshold down to 0.0.

@jdksjolen
Copy link
Contributor Author

Does this increase context switch? In particular, hotspot may not enable unified log at all. The flush thread wakes up every second for nothing?

Yes, it increases context switching. If Hotspot hasn't enabled -Xlog:async then this code won't run.

I understand that we somehow need to flush the queue if the log stream is end and fraction_used() is still lower than 0.3. I think we can remember a historic fraction_used. As long as this value increases, it suggests that logs still trickle in. we don't need to worry about it if so.

The reasoning for the timeout was two fold:

  1. As you said: the log stream is end and fraction_used() is still lower than 0.3, we should still log in a timely manner.
  2. Flushing logs in a timely manner, making it suitable for interactive use

Let's say that a historic fraction is used, then where would the comparison be placed? The only place I can think of is the enqueue method, but then we already know that messages are being supplied. We could compute a rate of change and if that rate is too low then we flush each time. Both of these solutions depend on an infinite stream of logs (however far apart the messages are) if the timeout is not used. Perhaps this should be combined with a much longer timeout? Such as 30 seconds. I think spuriously waking up every 30 seconds is acceptable.

Using this algorithm, I think we can make even bigger threshold. The reasonable threshold is up to the speed of IO. In reality, the speed of IO is not known until runtime. I think we can make it adaptive in the future. if we detect overflow, we lower the threshold down to 0.0.

It's reasonable for this RFE to lower the threshold directly down to 0.0 if overflow is detected, I agree that it's a good idea for a future RFE to implement something more dynamic.

Copy link
Member

@navyxliu navyxliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am not a reviewer. we still need other reviewers to approve it.

@navyxliu
Copy link
Member

navyxliu commented Jan 2, 2023

hi, @jdksjolen

Yes, it increases context switching. If Hotspot hasn't enabled -Xlog:async then this code won't run.

Fair enough. One corner case is that users declare '-Xlog:async' but don't provide any '-Xlog:' selections. They may postpone to select logs in runtime using dcmd. We can ignore that because the compromise is acceptable.

Both of these solutions depend on an infinite stream of logs (however far apart the messages are) if the timeout is not used. Perhaps this should be combined with a much longer timeout? Such as 30 seconds. I think spuriously waking up every 30 seconds is acceptable.

I agree. No matter what, we can't put the flush thread in sleep without a timeout argument. We can't flush holding messages until termination. The fixed timeout 1s isn't perfect but I think it's reasonable. Your patch looks good to me.

@dholmes-ora
Copy link
Member

There is obviously a spectrum of changes possible based on how big a buffer is used and how often it is flushed - can the user already control these two things? If yes then I don't see a justification to change the current behaviour. If no then surely that is what is needed?

@navyxliu
Copy link
Member

navyxliu commented Jan 3, 2023

hi, @dholmes-ora

Current implementation lacks the parameter of "how often it is flushed". Logsites notify the flush thread for every single message and it is the most conservative way. The general idea is to flush as many messages as possible each time. This amortizes the cost of synchronization and external IO. I think this patch is toward this goal.

@dholmes-ora
Copy link
Member

The premise of this patch was "the buffer is being under-utilised", so it delays flushing so that the buffer gets more full. This seems an arbitrary selection - we could have made the buffer smaller instead. I'm not seeing a rationale for the current change that justifies changing the behaviour for all users in a way that they cannot compensate for. If buffer utilisation is a metric of interest (and I'm not convinced it is) then we should be providing a way to control the buffer size (already there) and the flush rate.

@jdksjolen
Copy link
Contributor Author

Hi David, I was surprised to hear that your intuition regarding this went against my own. So, that caused me to re-do the measurements and I realised I made a big mistake: I ran the tests without optimisations (slowdebug). I then decided to re-run the measurements with an optimised build (no perf stat as this changes the performance drastically).

The results of those measurements showed that all=trace do cause messages to be dropped with my patch. all=debug shows behavior similar to my initial testing: my patch fully saturates the buffer while original code doesn't. I suspect that there is huge lock contention on the AsyncLogLocker in the all=trace case, which makes it very difficult for the writer to gain access in time for the buffer to be filled and messages to be dropped.

With that in mind, maybe what should be looked into isn't this patch, but rather replacing the buffer with a lock-free circular one (for example, maybe there are better fits). That is, if we want better performance.

@navyxliu , maybe we should avoid merging this patch as it leads to increased complexity while there are other measures that could give much bigger gains?

@navyxliu
Copy link
Member

hi, @jdksjolen
It's fine to drop it. We should avoid premature optimizations. I think the number of context switch is a good metric. "buffer utilisation" may not be a primary metric.

I still have a task in my backlog. I plan to compress Message when consecutive messages are from the same LogOutput. I can save a pointer(8bytes) for each message.

Yes, lock-free circular is better, but I am not sure we have a scenario like Linux kernel printk, which is always highly contented in kernel threads. If we ignore the extreme case such as trace=all, I only observed heat lock contention among current GC threads when 'gc*=debug'.

thanks,
--lx

@jdksjolen jdksjolen closed this Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

3 participants