Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: intel-ucode 20191113 causes lockup on reboot #21

Open
ahesford opened this issue Nov 14, 2019 · 40 comments
Open

Regression: intel-ucode 20191113 causes lockup on reboot #21

ahesford opened this issue Nov 14, 2019 · 40 comments

Comments

@ahesford
Copy link

@ahesford ahesford commented Nov 14, 2019

On an Arch Linux installation, intel-ucode 20191113 causes a lockup when rebooting a Dell Precision 5820 workstation with an Intel Xeon W-2145 CPU. The system cold boots fine, but once the system is running, a reboot will cause a lockup when the kernel is reloaded. This is a regression from intel-ucode 20190918 and affects more than one bootloader and at least the linux, linux-zen and linux-lts kernels distributed by Arch. Because these three kernels are all affected and switching to the earlier intel-ucode package resolves the issue, I believe this is an upstream issue rather than a problem with the Arch package.

Additional info:

Affected version: 20191113
Last working version: 20190918

Bootloaders affected:

  • systemd-boot (from systemd 243, Arch package systemd-243.78-2)
  • rEFInd (version 0.11.3, Arch package refind-efi-0.11.3-1)

Note: rEFInd was not configured to apply the microcode patch at boot. Instead, the system was cold-booted from systemd-boot to apply the microcode patch, the boot manager was replaced with rEFInd, and the system was warm-rebooted using rEFInd. Thus, the reboot lockup does not appear to be caused by the act of loading the microcode, but instead causes the CPU to lock up after applying the microcode at least once and warm resetting.

Kernels affected:

  • linux (5.3.11, Arch package linux-5.3.11.1-1)
  • linux longterm (4.19.84, Arch package linux-lts-4.19.84-1)
  • linux zen (5.3.11, Arch package linux-zen-5.3.11.1-1)

For systemd-boot, the loader entry is:
title Arch Linux
linux /vmlinuz-linux
initrd /intel-ucode.img
initrd /initramfs-linux.img
options root=UUID=[masked] rw
options consoleblank=600
options audit=0

For rEFInd, the loader entry was created automatically using refind-install and applied the same kernel arguments (root, consoleblank and audit) as the systemd loader.

Steps to reproduce:

  1. Install Arch Linux and any of the kernels listed above (other kernels may be similarly affected).
  2. Install package intel-ucode 20191113-1.
  3. Configure systemd-boot to boot with a loader entry like that above, making sure to load the microcode during boot.
  4. Cold boot the system; everything should boot as expected.
  5. Invoke "shutdown -r now".
  6. After systemd-boot selects the kernel to boot, the system should hang on the "SHA256 validated" message.
  7. Forcibly power down the system.
  8. [In my case, attempted to turn on the system at this point will cause the fans to spin, then the system to immediately shut itself down; powering on a second time will bring up the system as expected.]
  9. Replace intel-ucode 20191113-1 with version 20190918-1 and confirm that the system boots and reboots as expected.
@hmh

This comment has been minimized.

Copy link

@hmh hmh commented Nov 14, 2019

Can you tell us the output of "cat /proc/cpuinfo" ?

@esyr-rh

This comment has been minimized.

Copy link
Contributor

@esyr-rh esyr-rh commented Nov 14, 2019

And grep -r . /sys/devices/system/cpu/vulnerabilities as well, please?

@ahesford

This comment has been minimized.

Copy link
Author

@ahesford ahesford commented Nov 14, 2019

Below is the content of /proc/cpuinfo for the first virtual CPU. This is an eight-core model with hyperthreading, so this block repeats 15 more times. The only differences are the frequencies (which obviously jump around), the core ID (which matches the processor index), and the apicid and initial apicid fields (which have the same value for each processor: twice the processor index for indices 0 thorugh 7, and twice the processor index plus one for indices 8 through 15).

If you want all of the other blocks, please let me know.

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz
stepping	: 4
microcode	: 0x2000064
cpu MHz		: 1450.713
cache size	: 11264 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 7402.02
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:
@ahesford

This comment has been minimized.

Copy link
Author

@ahesford ahesford commented Nov 14, 2019

Below are the contents of /sys/devices/system/cpu/vulnerabilities.

/sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling
/sys/devices/system/cpu/vulnerabilities/itlb_multihit:KVM: Mitigation: Split huge pages
/sys/devices/system/cpu/vulnerabilities/mds:Mitigation: Clear CPU buffers; SMT vulnerable
/sys/devices/system/cpu/vulnerabilities/l1tf:Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass:Mitigation: Speculative Store Bypass disabled via prctl and seccomp
/sys/devices/system/cpu/vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable
/sys/devices/system/cpu/vulnerabilities/spectre_v1:Mitigation: usercopy/swapgs barriers and __user pointer sanitization
/sys/devices/system/cpu/vulnerabilities/meltdown:Mitigation: PTI
@esyr-rh

This comment has been minimized.

Copy link
Contributor

@esyr-rh esyr-rh commented Nov 15, 2019

Hm, may I ask to supply "debug" parameter to the kernel's command line?

@ahesford

This comment has been minimized.

Copy link
Author

@ahesford ahesford commented Nov 15, 2019

Adding "debug" does not cause a change of behavior after a warm reboot. The system locks up after the systemd-boot "SHA256 validated" message without any additional information. The dmesg output after the first (cold) boot with debug enabled is attached.

How else may I help to isolate this issue?

dmesg.debug.log

@sclarkson

This comment has been minimized.

Copy link

@sclarkson sclarkson commented Nov 15, 2019

I'm experiencing this as well.

System information

Mobo: ASUS WS X299 SAGE
CPU: 9920x and 9820x
OS: Ubuntu 18.04.3

I've tried Ubuntu's kernel 5.0.0-36-generic, as well as the mainline 5.3.11. Both exhibit the problem. I've also tried the latest BIOS from ASUS.

Confirmed that upgrading to 20191112 from 20190918 caused the issue.

System hangs when GRUB tries to load the kernel after either running reboot on the command line or pressing the reset button on the motherboard.

@esyr-rh

This comment has been minimized.

Copy link
Contributor

@esyr-rh esyr-rh commented Nov 15, 2019

It seems that there's a new release has just made available[1], may I ask to try it out?

[1] https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/releases/tag/microcode-20191115

@hagar-dunor

This comment has been minimized.

Copy link

@hagar-dunor hagar-dunor commented Nov 17, 2019

Gentoo user, same issue reported here on a Core i9-7920x. 20191115 doesn't seem to do any better.

@esyr-rh

This comment has been minimized.

Copy link
Contributor

@esyr-rh esyr-rh commented Nov 18, 2019

Right, there are no updates for 06-55-04 in microcode-20191115, so there's no point in testing 2019115, my apologies.

@hmh

This comment has been minimized.

Copy link

@hmh hmh commented Nov 18, 2019

Looks server-die related (all reports are from HEDT parts or Xeon parts) ?

Kabylake desktop and mobile (0x806e9, 0x906e9) are not showing the reboot issue here.
CoffeeLake mobile (0x906ea) did now show any reboot issues, either.

@esyr-rh

This comment has been minimized.

Copy link
Contributor

@esyr-rh esyr-rh commented Nov 18, 2019

Looks server-die related (all reports are from HEDT parts or Xeon parts) ?

Both reports are against CPUID 0x50654 parts, yes.

Skylake mobile and desktop (0x806e9, 0x906e9) are not showing the reboot issue here.

FYI, Skylake mobile/desktop have CPUID 0x406e3 and 0x506e3, respectively.

@esyr-rh

This comment has been minimized.

Copy link
Contributor

@esyr-rh esyr-rh commented Nov 18, 2019

Gentoo user, same issue reported here on a Core i9-7920x. 20191115 doesn't seem to do any better.

May I ask to try out microcode-20190918 release, with 06-55-04 microcode revision 0x2000064? Thank you.

@hmh

This comment has been minimized.

Copy link

@hmh hmh commented Nov 18, 2019

easy-rh: oops, sorry about that! wrong names but correct cpuids: tests were done on 0x806e9, 0x906e9, 0x906ea. None show the reboot issue.
(edited the incorrect post, it now has the proper processor names).

@ahesford

This comment has been minimized.

Copy link
Author

@ahesford ahesford commented Nov 18, 2019

I can confirm that the issue persists on a Xeon W-2145 with 20191115, while the issue has never affected a Core i7-8705G in a Dell XPS 15 9575.

@esyr-rh

This comment has been minimized.

Copy link
Contributor

@esyr-rh esyr-rh commented Nov 18, 2019

I can confirm that the issue persists on a Xeon W-2145 with 20191115,
I apologise for a pointless question regarding 20191115, checking against 20190918 makes much more sense (as that's the release where the previous 0x2000064 revision of 06-55-04 microcode is provided), may I kindly ask to try to do so?

@ahesford

This comment has been minimized.

Copy link
Author

@ahesford ahesford commented Nov 18, 2019

I can confirm that the issue persists on a Xeon W-2145 with 20191115,

I apologise for a pointless question regarding 20191115, checking against 20190918 makes much more sense (as that's the release where the previous 0x2000064 revision of 06-55-04 microcode is provided), may I kindly ask to try to do so?

I'm not clear about what you'd like me to try; the 20190918 release works perfectly with my CPU.

@esyr-rh

This comment has been minimized.

Copy link
Contributor

@esyr-rh esyr-rh commented Nov 18, 2019

Sorry, I've forgotten about the fact that you have already provided this information; again, my apologies.

@ahesford

This comment has been minimized.

Copy link
Author

@ahesford ahesford commented Nov 18, 2019

Sorry, I've forgotten about the fact that you have already provided this information; again, my apologies.

No worries. Please let me know if I can try anything else to illuminate the problem.

@ahesford

This comment has been minimized.

Copy link
Author

@ahesford ahesford commented Nov 18, 2019

I may have a resolution: the Arch package is built using

iucode_tool -w kernel/x86/microcode/GenuineIntel.bin intel-ucode{,-with-caveats}/

The manpage for iucode_tool indicates that early firmware images must be 16-byte aligned and that the --write-earlyfw option enforces this. Altering the Arch PKGBUILD to use --write-earlyfw instead of -w changes the size of the microcode image and seems to fix the issue. I've warm-rebooted four times with the modified image and see no issues.

It looks to me like Gentoo is using the --write-firmware instead of --write-earlyfw as well. Maybe this explains the problem seen by @hagar-dunor.

@hagar-dunor

This comment has been minimized.

Copy link

@hagar-dunor hagar-dunor commented Nov 18, 2019

ahesford: I'm following this wiki and therefore using --write-earlyfw

to be more specific: this is exactly the command I type
iucode_tool -S --write-earlyfw=/boot/early_ucode.cpio /lib/firmware/intel-ucode/*
and then update the grub config file which picks up the microcode

Are you certain that you actually load the microcode with the modified Arch PKGBUILD? (if must be the first line in your "dmesg")

esyr-rh: revision 0x2000064 doesn't show the problem (which I extracted using the same command above)

@esyr-rh

This comment has been minimized.

Copy link
Contributor

@esyr-rh esyr-rh commented Nov 18, 2019

I may have a resolution: the Arch package is built using

iucode_tool -w kernel/x86/microcode/GenuineIntel.bin intel-ucide{,-with-caveats}/

What if only 06-55-04 microcode is added, like iucode_tool --write-earlyfw kernel/x86/microcode/GenuineIntel.bin intel-ucode/06-55-04 (or -w, for that matter)?

@ahesford

This comment has been minimized.

Copy link
Author

@ahesford ahesford commented Nov 18, 2019

My mistake; I overlooked that --write-earlyfw creates the CPIO archive directly, but -w creates a binary image. Changing the Arch PKGBUILD as I suggest creates an invalid initrd that the microcode update driver ignores.

@esyr-rh, a custom early initrd that contains only 06-55-04 still exhibits the warm-reboot issue.

@hmh

This comment has been minimized.

Copy link

@hmh hmh commented Nov 18, 2019

The 16-byte alignment is an old requirement, it might have even been lifted from the IA32 manual nowadays. I need to hunt it down one of these days, and update iucode_tool accordingly...

@whpenner

This comment has been minimized.

Copy link

@whpenner whpenner commented Nov 20, 2019

The 16-byte alignment has been and continues to be a requirement. This can be found in the Intel(R) 64 and IA-32 Architectures Software Developer's Manual, vol 3A, page 9-34 (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf). "Note that the microcode update must be aligned on a 16-byte boundary and the size of the microcode update must be 1-KByte granular."

@whpenner

This comment has been minimized.

Copy link

@whpenner whpenner commented Nov 20, 2019

So, did I read this thread correctly and it is not a microcode issue?

@ahesford

This comment has been minimized.

Copy link
Author

@ahesford ahesford commented Nov 20, 2019

So, did I read this thread correctly and it is not a microcode issue?

I believe this is a microcode issue. My first attempt with the --write-earlyfw option to to iucode_tool was incorrect, producing an invalid image that was ignored by the kernel loader. Subsequent attempts with proper use of --write-earlyfw produce images that are properly loaded but that continue to show the warm-reboot lockup issue.

@hmh

This comment has been minimized.

Copy link

@hmh hmh commented Nov 20, 2019

@whpenner, thanks. I will keep the remark about it in the iucode-tool manual, then.

But current Intel processors do not care. It would be really nice to know if future processors will care, though.

For the record, they don't care about the 1KiB size either, but it is a good idea to ensure that padding is there just in case the dang thing will read-past-the-end and cause a fault.

@wendigo

This comment has been minimized.

Copy link

@wendigo wendigo commented Nov 26, 2019

Same here. Microcode 3.20191115 on i7-7820X CPU. Systems boots fine but after warm-reboot it got stuck

@xwjabc

This comment has been minimized.

Copy link

@xwjabc xwjabc commented Nov 30, 2019

Same here. In dmesg: microcode: microcode updated early to revision 0x2000065, date = 2019-09-05
Ubuntu 18.04.3 LTS on i9 9900X CPU. Systems boots fine but after warm-reboot it got stuck.

However, when I use
(home) ~$ apt list --installed | grep micro
it shows:

amd64-microcode/bionic-updates,bionic-security,now 3.20191021.1+really3.20181128.1~ubuntu0.18.04.1 amd64 [installed,automatic]
intel-microcode/bionic-updates,bionic-security,now 3.20191115.1ubuntu0.18.04.1 amd64 [installed,automatic]
@stevebeattie

This comment has been minimized.

Copy link

@stevebeattie stevebeattie commented Dec 2, 2019

@xwjabc yes, the intel-microcode 3.20191115.1ubuntu0.18.04.1 (and other 3.20191115.1 packages for other Ubuntu releases) includes:

sig 0x00050654, pf_mask 0xb7, 2019-09-05, rev 0x2000065, size 34816

which is the latest microcode from this repository for your processor class.

@stevebeattie

This comment has been minimized.

Copy link

@stevebeattie stevebeattie commented Dec 2, 2019

@whpenner we now have reports from Ubuntu users getting hit by this (e.g. https://bugs.launchpad.net/ubuntu/+source/intel-microcode/+bug/1854764 ); the way that early load microcode is written to the initramfs in Debian and Ubuntu is with iucode-tool --write-earlyfw, and thus should be 16-byte aligned, so it does appear that this is a problem with the microcode itself.

@whpenner

This comment has been minimized.

Copy link

@whpenner whpenner commented Dec 13, 2019

Intel has received reports of reboot failures on certain Skylake based Intel® Xeon® W and Intel® Core™ X-series single socket platforms following the OS load of processor microcode revision 0x65. We have received no reports and have no evidence that these failures affect Skylake based Intel® Xeon® Scalable Performance multi-socket platforms. We are debugging the issue to establish root cause. In the interim, processor microcode revision 0x64 remains available for use.

@esyr-rh

This comment has been minimized.

Copy link
Contributor

@esyr-rh esyr-rh commented Dec 13, 2019

Here is the revision 0x2000064 of 06-55-04 microcode, for the reference: https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/blob/microcode-20190918/intel-ucode/06-55-04

@hmh

This comment has been minimized.

Copy link

@hmh hmh commented Dec 13, 2019

@whpenner, thanks for the official guidance.

@hmh

This comment has been minimized.

Copy link

@hmh hmh commented Dec 25, 2019

There are reports, one on Debian and another on Ubuntu, that revision 0x2000064 of the 0x50654 microcode DOES have the hang-on-reboot issue.

That leaves us on a very nasty position of either telling users to deal with it and never reboot, or go back to revision 0x200005e, which has the JCC erratum and a lot of other nasty issues as far as I know (please correct me if I am wrong about this).

Can Intel give us a tentative timeframe for a fix? Or guidance on the least dangerous workarounds available?

@eworm-de

This comment has been minimized.

Copy link

@eworm-de eworm-de commented Dec 25, 2019

No complains from Arch users...

Possibly those reporting the issue installed a fixed package, then did a warm reboot but had revision 0x65 loaded into their processors already?

@hmh

This comment has been minimized.

Copy link

@hmh hmh commented Dec 26, 2019

@ahesford

This comment has been minimized.

Copy link
Author

@ahesford ahesford commented Dec 26, 2019

There are reports, one on Debian and another on Ubuntu, that revision 0x2000064 of the 0x50654 microcode DOES have the hang-on-reboot issue.

I definitely do not see the warm-reboot lockup with version 0x2000064 on the Xeon W-2145. Only the newer version causes the issue.

@hmh

This comment has been minimized.

Copy link

@hmh hmh commented Dec 26, 2019

Time to compare the chip names and/or process flags :-(

vt-alt pushed a commit to vt-alt/specs that referenced this issue Jan 20, 2020
- Sync with Debian 3.20191115.2:
  + New upstream microcode datafile 20191115
  + Microcode rollbacks (closes: debian #946515, LP#1854764):
    sig 0x00050654, pf_mask 0xb7, 2019-07-31, rev 0x2000064, size 33792
  + Avoids hangs on warm reboots (cold boots work fine) on HEDT and
    Xeon processors with signature 0x50654.
    intel/Intel-Linux-Processor-Microcode-Data-Files#21
  + Updated Microcodes:
    sig 0x000406e3, pf_mask 0xc0, 2019-10-03, rev 0x00d6, size 101376
    sig 0x000506e3, pf_mask 0x36, 2019-10-03, rev 0x00d6, size 101376
    sig 0x000806e9, pf_mask 0x10, 2019-10-15, rev 0x00ca, size 100352
    sig 0x000806e9, pf_mask 0xc0, 2019-09-26, rev 0x00ca, size 100352
    sig 0x000806ea, pf_mask 0xc0, 2019-10-03, rev 0x00ca, size 100352
    sig 0x000806eb, pf_mask 0xd0, 2019-10-03, rev 0x00ca, size 100352
    sig 0x000806ec, pf_mask 0x94, 2019-10-03, rev 0x00ca, size 100352
    sig 0x000906e9, pf_mask 0x2a, 2019-10-03, rev 0x00ca, size 100352
    sig 0x000906ea, pf_mask 0x22, 2019-10-03, rev 0x00ca, size 99328
    sig 0x000906eb, pf_mask 0x02, 2019-10-03, rev 0x00ca, size 100352
    sig 0x000906ec, pf_mask 0x22, 2019-10-03, rev 0x00ca, size 99328
    sig 0x000906ed, pf_mask 0x22, 2019-10-03, rev 0x00ca, size 100352
    sig 0x000a0660, pf_mask 0x80, 2019-10-03, rev 0x00ca, size 91136
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.