Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8283199: Linux os::cpu_microcode_revision() stalls cold startup #7825

Closed

Conversation

shipilev
Copy link
Member

@shipilev shipilev commented Mar 15, 2022

Noticed this when staring at timeline profiles for JVM startup. If you run any small startup workload on Linux, there is a significant time gap where JVM does nothing. I pinpointed that to fopen of /proc/cpuinfo due to JDK-8249672.

This does not reproduce if you run startup workloads continuously, as it looks as if /proc/cpuinfo is cached for about a second, and stalls reads after that, I suspect for CPU info updates, like frequency. This reproduces on at least two of my systems running Linux kernels 5.4 and 5.15.

Observe:

$ for I in `seq 1 5`; do sleep 0.99; time cat /proc/cpuinfo > /dev/null; done
real	0m0.022s
real	0m0.003s
real	0m0.004s
real	0m0.004s
real	0m0.004s

$ for I in `seq 1 5`; do sleep 1.00; time cat /proc/cpuinfo > /dev/null; done
real	0m0.022s
real	0m0.022s
real	0m0.022s
real	0m0.022s
real	0m0.022s

It directly impacts JVM startup:

$ for I in `seq 1 3`; do sleep 1; time build/linux-x86_64-server-release/images/jdk/bin/java -Xms128m -Xmx128m -version > /dev/null 2>&1 ; done

real	0m0.046s
real	0m0.045s
real	0m0.046s

Without the sleep:

$ for I in `seq 1 3`; do time build/linux-x86_64-server-release/images/jdk/bin/java -Xms128m -Xmx128m -version > /dev/null 2>&1 ; done

real	0m0.026s
real	0m0.026s
real	0m0.023s

There is another way to do this: read /sys/devices/system/cpu/cpu0/microcode/version, this is what this patch does. With sleep 1:

$ for I in `seq 1 3`; do sleep 1; time build/linux-x86_64-server-release/images/jdk/bin/java -Xms128m -Xmx128m -version > /dev/null 2>&1 ; done

real	0m0.025s
real	0m0.024s
real	0m0.024s

Which means it improves startup time from ~45ms to ~25ms, or about 1.8x!

os::cpu_microcode_revision() is currently used to generate the VM features string. It raises a bigger question if VM features string should be generated on startup, but it seems to take negligible time otherwise, see JDK-8283200. Given JDK-8249672 had been backported to update releases, let's do a pointed fix for this regression first.

Additional testing:

  • Linux x86_64 fastdebug tier1
  • Eyeballing reported microcode number with/without the fix -- same thing reported

Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8283199: Linux os::cpu_microcode_revision() stalls cold startup

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk pull/7825/head:pull/7825
$ git checkout pull/7825

Update a local copy of the PR:
$ git checkout pull/7825
$ git pull https://git.openjdk.java.net/jdk pull/7825/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 7825

View PR using the GUI difftool:
$ git pr show -t 7825

Using diff file

Download this PR as a diff file:
https://git.openjdk.java.net/jdk/pull/7825.diff

@bridgekeeper
Copy link

bridgekeeper bot commented Mar 15, 2022

👋 Welcome back shade! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Mar 15, 2022
@openjdk
Copy link

openjdk bot commented Mar 15, 2022

@shipilev The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-runtime hotspot-runtime-dev@openjdk.org label Mar 15, 2022
@mlbridge
Copy link

mlbridge bot commented Mar 15, 2022

Webrevs

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What versions of Linux have /sys/devices/system/cpu/cpu0/microcode/version? And is it present in virtualized environments? What cost is there for systems that always have to take the second path?

Thanks,
David

break;
sscanf(data, "%x", &result);
}
fclose(fp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff would be much simpler if you added a return here and so avoided the need for the if statement below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See new commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that change didn't really help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I don't see what you mean then, so I am open for clear suggestion how would you write it.

@shipilev
Copy link
Member Author

shipilev commented Mar 21, 2022

What versions of Linux have /sys/devices/system/cpu/cpu0/microcode/version?

I looked through Linux kernel sources, and I believe it was added by torvalds/linux@9a4b9ef back in 2006, and landed in 2.6.19. So even a relatively ancient RHEL 6 already has it.

And is it present in virtualized environments?

One of my x86_64 QEMU nodes does not have it, apparently. There is still /proc/cpuinfo with microcode: field there. I think another trouble in virtualized environment would be not having the CPU0, but rather some other CPU id.

What cost is there for systems that always have to take the second path?

So, emulating the missing sysfs file...

Cold start:

# Baseline
real	0m0.043s
real	0m0.042s
real	0m0.042s
real	0m0.043s
real	0m0.042s

# Real CPU id in sysfs (this patch)
real	0m0.024s
real	0m0.022s
real	0m0.023s
real	0m0.023s
real	0m0.023s

# Non-existent CPU id in sysfs
real	0m0.045s
real	0m0.043s
real	0m0.043s
real	0m0.043s
real	0m0.043s

Hot start:

# Baseline
real	0m0.046s
real	0m0.024s
real	0m0.024s
real	0m0.024s
real	0m0.026s

# Real CPU id in sysfs (this patch)
real	0m0.025s
real	0m0.023s
real	0m0.023s
real	0m0.023s
real	0m0.025s

# Non-existent CPU id in sysfs
real	0m0.045s
real	0m0.025s
real	0m0.026s
real	0m0.025s
real	0m0.025s

So I'd say we pay about 1 ms penalty if sysfs file does not exist. Instrumenting the sysfs read path shows the entire access to missing sysfs entry takes <100 us.

I would chalk it up as "pretend this is the actual performance cost for JDK-8249672 work", push this follow-up, and actually look at moving all this mess out of the startup sequence with JDK-8283200.

Copy link
Member

@cl4es cl4es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! Most startup tests of ours are continuous so I can't say we detected this regression.

src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp Outdated Show resolved Hide resolved
@openjdk
Copy link

openjdk bot commented Mar 21, 2022

@shipilev This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8283199: Linux os::cpu_microcode_revision() stalls cold startup

Reviewed-by: dholmes, redestad, stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 58 new commits pushed to the master branch:

  • 8cc1235: 8282952: Thread::exit should be immune to Thread.stop
  • 33eb89d: 8283457: [macos] libpng build failures with Xcode13.3
  • f7d21c3: 8283480: Make AbstractStringBuilder sealed
  • d29c7e7: 8282590: C2: assert(addp->is_AddP() && addp->outcnt() > 0) failed: Don't process dead nodes
  • 557ff4b: 8282625: Formatter caches Locale/DecimalFormatSymbols poorly
  • fabde3b: 8283451: C2: assert(_base == Long) failed: Not a Long
  • c0f984e: 8283456: Make CompiledICHolder::live_count/live_not_claimed_count debug only
  • 85628a8: 8282592: C2: assert(false) failed: graph should be schedulable
  • a6fd0b2: 8283087: Create a test or JDK-4715503
  • 37fc77e: 8283092: JMX subclass permission check redundant with strong encapsulation
  • ... and 48 more: https://git.openjdk.java.net/jdk/compare/3e393047e12147a81e2899784b943923fc34da8e...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Mar 21, 2022
@tstuefe
Copy link
Member

tstuefe commented Mar 21, 2022

Interesting find.

Do you know why access to /proc is slower than to /sys/devices? Is it all /proc accesses or just /proc/cpuinfo?

My first thought was that since we use /proc for many things, we'd probably end up paying for it anyway.

Just idle thoughts, I'm fine with the patch in general (had no close look yet).

@dholmes-ora
Copy link
Member

I looked through Linux kernel sources, and I believe it was added by torvalds/linux@9a4b9ef back in 2006, and landed in 2.6.19. So even a relatively ancient RHEL 6 already has it.

My "Oracle Linux Server release 7.9" doesn't have it. But I'm not sure if I'm virtualized or not ... I suspect yes.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block structure still makes the diff a lot more complex than it needs to be given you are just inserting a new block of code. But the functional change seems okay.

src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp Outdated Show resolved Hide resolved
src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp Outdated Show resolved Hide resolved
@shipilev
Copy link
Member Author

shipilev commented Mar 21, 2022

Do you know why access to /proc is slower than to /sys/devices? Is it all /proc accesses or just /proc/cpuinfo?

I suspect that's because we generate the entirety of /proc/cpuinfo on read, which includes doing lots of string printing, getting the CPU frequency (which, AFAICS, includes IPI). I have not explored this part deeply, though.

$ perf record -F max -g md5sum /proc/cpuinfo
$ perf report
...
   - do_syscall_64                             
      - 61.74% __x64_sys_read                  
           ksys_read                           
           vfs_read                            
           __vfs_read                          
           proc_reg_read                       
         - seq_read                            
            - 60.53% show_cpuinfo              
               - 41.30% seq_printf             
                  - 36.37% seq_vprintf         
                     + 19.35% vsnprintf        
                       12.11% memcpy           
                       1.24% format_decode     
                    1.23% vsnprintf            
               - 9.71% aperfmperf_get_khz      
                  - aperfmperf_snapshot_cpu    
                    smp_call_function_single   
                 1.23% memcpy                  
              1.21% page_fault                 
      - 11.90% __x64_sys_openat                
           do_sys_open                         
           do_filp_open                        
           path_openat                         
           do_last                             
           vfs_open                            
           do_dentry_open                      
           proc_reg_open                       
           cpuinfo_open                        
         - arch_freq_prepare_all               
            - 10.81% aperfmperf_snapshot_cpu   
               - smp_call_function_single
                    llist_add_batch
            + 1.09% msleep
      + 2.51% __x64_sys_execve

@shipilev
Copy link
Member Author

Do you know why access to /proc is slower than to /sys/devices? Is it all /proc accesses or just /proc/cpuinfo?

I suspect that's because we generate the entirety of /proc/cpuinfo on read, which includes doing lots of string printing, getting the CPU frequency (which, AFAICS, includes IPI). I have not explored this part deeply, though.

I further suspect this is where the 1 second "cache" comes from: getting frequency from APERF snapshot: https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/aperfmperf.c#L33

@tstuefe
Copy link
Member

tstuefe commented Mar 21, 2022

Do you know why access to /proc is slower than to /sys/devices? Is it all /proc accesses or just /proc/cpuinfo?

I suspect that's because we generate the entirety of /proc/cpuinfo on read, which includes doing lots of string printing, getting the CPU frequency (which, AFAICS, includes IPI). I have not explored this part deeply, though.

Makes sense. And it does it for every core.

Copy link
Member

@tstuefe tstuefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Thanks.

@shipilev
Copy link
Member Author

Thanks for reviews!

/integrate

@openjdk
Copy link

openjdk bot commented Mar 23, 2022

Going to push as commit 1443f6b.
Since your change was applied there have been 61 commits pushed to the master branch:

  • 82e1a1c: 8283257: x86: Clean up invocation/branch counter updates code
  • b035fda: 8283497: [windows] print TMP and TEMP in hs_err and VM.info
  • 6ea996c: 8282422: JTable.print() failed with UnsupportedCharsetException on AIX ko_KR locale
  • 8cc1235: 8282952: Thread::exit should be immune to Thread.stop
  • 33eb89d: 8283457: [macos] libpng build failures with Xcode13.3
  • f7d21c3: 8283480: Make AbstractStringBuilder sealed
  • d29c7e7: 8282590: C2: assert(addp->is_AddP() && addp->outcnt() > 0) failed: Don't process dead nodes
  • 557ff4b: 8282625: Formatter caches Locale/DecimalFormatSymbols poorly
  • fabde3b: 8283451: C2: assert(_base == Long) failed: Not a Long
  • c0f984e: 8283456: Make CompiledICHolder::live_count/live_not_claimed_count debug only
  • ... and 51 more: https://git.openjdk.java.net/jdk/compare/3e393047e12147a81e2899784b943923fc34da8e...master

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Mar 23, 2022
@openjdk openjdk bot closed this Mar 23, 2022
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Mar 23, 2022
@openjdk
Copy link

openjdk bot commented Mar 23, 2022

@shipilev Pushed as commit 1443f6b.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@shipilev shipilev deleted the JDK-8283199-cpu-microcode-revision branch March 23, 2022 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated
4 participants