Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Z3735G update to 0x838 microcode causes tablet to hang at boot #55

Open
jwrdegoede opened this issue May 9, 2021 · 14 comments
Open

Z3735G update to 0x838 microcode causes tablet to hang at boot #55

jwrdegoede opened this issue May 9, 2021 · 14 comments

Comments

@jwrdegoede
Copy link

jwrdegoede commented May 9, 2021

While running Linux on a Glavey TM800A550L tablet I noticed that it hangs at boot, sometimes showing various color patterns on the display, suggesting that the processor is writing over random memory, including the framebuffer.

It took me a while to figure this out, but adding dis_ucode_ldr on the kernel commandline fixes this. This is quite a sever bug, at one point in when I forgot to add dis_ucode_ldr on the kernel commandline the CPU overwrote parts of the memory-mapped SPI flash which contains the EFI nvram variables, including the Setup EFI variable which contains the BIOS settings. After this the tablet would no longer boot at all. I eventually managed to unbrick it without needing an eeprom programmer by using DNX mode which still worked, see the blogpost which I wrote on this.

The BIOS on the troublesome tablet comes with ucode version 0x830 and the attempted update (which breaks things) tries to update it to 0x838. Perhaps there is a problem with the specific jump from 0x830 to 0x838 ?

Note I have an Acer S1002 tablet which also has a Z3735G processor and there the microcode update works fine.

@esyr-rh
Copy link
Contributor

esyr-rh commented May 10, 2021

CPUID is 0x30678 (FF-MM-SS 06-37-08), I presume?

@dslul
Copy link

dslul commented May 11, 2021

@jwrdegoede I couldn't post to your blog, but I got the same issue exactly a week ago on my HP x2 10 (atom z8350) with Fedora. After the usual reboot for updates the pc was completely dead. Just the caps lock led turning on for a second when connecting the keyboard dock. I managed to recover it by formatting a usb stick with the official bios tool, and with the ctrl V combination at boot it started the bios recovery procedure.

@jwrdegoede
Copy link
Author

CPUID is 0x30678 (FF-MM-SS 06-37-08), I presume?

Correct, for some extra info I've tried to figure out what the microcode version installed by the BIOS is on the Acer S1002, where the microcode update does not cause issues is. If I boot that device with "dis_ucode_ldr" on the kernel commandline and then lookup the microcode version in /proc/cpuinfo I get 0x82b. When I don't specify
"dis_ucode_ldr" on the (working) S1002 I get the following kernel log messages related to microcode:

[    0.000000] microcode: microcode updated early to revision 0x838, date = 2019-04-22
[    5.314952] microcode: sig=0x30678, pf=0x2, revision=0x838
[    5.315523] microcode: Microcode Update Driver: v2.2.

So the 2 cases which I have are:

  1. BIOS Microcode 0x82b -> 0x838 works fine (Acer S1002)
  2. BIOS Microcode 0x830 -> 0x838 device hangs, overwriting memory while hanging (Glavey TM800A550L tablet)

Note I'm not claiming that the difference in the BIOS installed microcode is the reason things are failing on the Glavey tablet, but it is a possible cause for this.

@jwrdegoede
Copy link
Author

jwrdegoede commented May 11, 2021

@jwrdegoede I couldn't post to your blog, but I got the same issue exactly a week ago on my HP x2 10 (atom z8350) with Fedora.

Interesting, is this microcode related too, or is the similarity just that you also got corrupt BIOS settings somehow?

@dslul
Copy link

dslul commented May 11, 2021

Interesting, is this microcode related too, or is the similarity just that you also got corrupt BIOS settings somehow?

I'm not sure if it was microcode related, but I find it curious that it happened at the same time, for the same family of processors. It's not a common fault after all.

@jwrdegoede
Copy link
Author

I'm not sure if it was microcode related, but I find it curious that it happened at the same time, for the same family of processors. It's not a common fault after all.

Cherry Trail and Bay Trail are related but use a (somewhat) different generation of CPU cores, also the last microcode update for these devices was quite a while ago. I hit this now because I only tried Linux on the Glavey tablet recently. I doubt that your issue is related to the ucode issue which I'm seeing. What it does have in common is that Bay and Cherry Trail devices are both susceptible to having their BIOS settings corrupted relatively easily. If you want to discuss this further please drop me an email at hdegoede@redhat.com, so that we can keep this github discussion focussed on the Bay Trail ucode issue.

@jwrdegoede
Copy link
Author

Ugh, I hit this again while I was testing Linux on a third tablet with a Z3735G CPU. At first things where fine, but the I updated the kernel from 5.12.0 to 5.13.0-rc1 and then it stopped booting. And adding "dis_ucode_ldr" made it boot again. 5.13 has only one new commit under arch/x86/kernel/cpu/microcode:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7189b3c11903667808029ec9766a6e96de5012a5

I tried reverting this but it does not help.

FWIW this third tablet BIOS updates the microcode to 0x832 before booting the OS.

@jwrdegoede
Copy link
Author

I just hit this again, on the same device. After re-installing Linux it took me a while to remember this issue. Is there anything we can do about this?

@jwrdegoede
Copy link
Author

One more note, on a hunch I tried updating the microcode after Linux booted by doing:

echo 1 > /sys/devices/system/cpu/microcode/reload

And this works fine, so I suspect there is some bad interaction here between the early microcode loader and the BIOS on the TM800A550L tablet. Maybe the early microcode loader has certain expectations about the pre-boot memory layout or some such ?

Any suggestions how I can debug the early microcode loading code. Is there some way to make it log messages using e.g. EFI calls to show the messages on the EFI console?

@hmh
Copy link

hmh commented Jan 5, 2022

It runs too early, last time I needed to debug the early update microcode driver, I had to add a static buffer to fill with debug data, and print it later. Maybe early printk on a system with an early console active can do it as well.

That said, do try a very recent upstream kernel, there were quite a few changes on that driver from what I recall from LKML mails, it might have increased compatibility with weird firmware, especially if it is EFI...

@jwrdegoede
Copy link
Author

That said, do try a very recent upstream kernel, there were quite a few changes on that driver from what I recall from LKML mails, it might have increased compatibility with weird firmware, especially if it is EFI...

Last time I hit this I tried 5.16-rc5, I guess I could try 5.17-rc1 once it is out if you think that might help?

@hmh
Copy link

hmh commented Jan 5, 2022

I don't have any reason to believe 5.17-rc would be any better than 5.16-rc5, unfortunately...

@jwrdegoede
Copy link
Author

Quick update on this, just hit this on a third tablet an Acer Iconia One 7 B1-750.

Both the Glavey TM800A550L and the Acer Iconia One 7 B1-750 are tablets which ship with Android 4.4 x86 as factory OS. And although they use more or less standard UEFI firmware, at least the ACPI tables are very funky, e.g. filled with not there I2C devices since the Android vendor kernel has everything hardcoded anyways. I guess the ucode update problem is related to the fw in these tablets being funky in other ways too. Any ideas ?

Unfortunately I did not write down here what the 3th Bay Trail device on which I hit this was, but I'm pretty sure it was a device with Android as factory OS too (as a hobby project I'm working on making these devices work with standard Linux).

@hmh
Copy link

hmh commented Mar 23, 2023

And this works fine, so I suspect there is some bad interaction here between the early microcode loader and the BIOS on the TM800A550L tablet. Maybe the early microcode loader has certain expectations about the pre-boot memory layout or some such ?

Any suggestions how I can debug the early microcode loading code. Is there some way to make it log messages using e.g. EFI calls to show the messages on the EFI console?

Try the microcode driver in the newest kernel, lots of fixes went in... who knows.

Also, this is a very long shot, but the Linux early microcode loader fails to ensure 16-byte alignment on the microcode patch [when loading from early-initramfs] -- it has been like that since day one of the early loading support, and I don't think this has been -- or will be -- fixed.

The Intel manual used to (and maybe still) require such 16-byte alignment, but apparently almost every non-ancient Intel x86-64 processor only cares for 4-byte alignment (at least almost all the time. Only Intel knows what happens if the start or end of the hot area of the microcode data crosses a page boundary due to miss-alignment, etc. It would not be the first CPU operation that is highly allergic to this kind of border condition).

We have long worked around this driver shortcoming in userspace when using iucode-tool to generate the initramfs, so you could try to use it to generate your early-initramfs.

Note that this is indeed a long shot, I have never seen any microcode loading issues be solved by forcing this alignment. Also, I do not know if there are alignment issues when you have the microcode early upload data hardcoded into the kernel in some other way the firmware loader supports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants