Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 x APU2E4 unstable with CPB enabled. #251

Open
MyGithubUser01 opened this issue May 9, 2021 · 17 comments
Open

2 x APU2E4 unstable with CPB enabled. #251

MyGithubUser01 opened this issue May 9, 2021 · 17 comments
Assignees

Comments

@MyGithubUser01
Copy link

Hi all,

I'm having some serious stability issues with APU2E4 and CPB with BIOS 4.13.0.1 and 4.13.0.5
This is brand new hardware which was believed to be "DOA" but the replacement I got had the exact same issue.
After disabling CPB the system appears to be stable and has an uptime of a record high 4 days and going.

Operating system tested OPNsense 21.1 and 21.1.5.
I've tried booting from msata, sd card and USB but it gives me the same issue.
I've also tried multiple power adapters.
The CPU Temperature is typically in the range 54-56c and the system isn't even connected to any network just the console cable.

The system has been very unstable and is core dumping every 4-12 hours. BIOS 4.13.0.1, but did see similar issues when testing 4.13.0.5.

From the logs/console I see the following:

FreeBSD/amd64 (OPNsense.localdomain) (ttyu0)
login: MCA: Bank 1, Status 0x9400000000000151
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0
MCA: CPU 0 COR ICACHE L1 IRD error
MCA: Address 0x282060
[HBSD SEGVGUARD] [/usr/local/bin/python3 (5880)] Suspension expired.
-> pid: 5880 ppid: 1302 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT>

And:
"root@OPNsense:/var/db/rrd # MCA: Bank 1, Status 0xd400000000000151
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0
MCA: CPU 0 COR OVER ICACHE L1 IRD error
MCA: Address 0xffff80d1ff60"

Let me know if additional details are required.
Broken hardware, bios bug, OPNsense/HardenedBSD compatibility issues?

@miczyg1 miczyg1 self-assigned this May 9, 2021
@miczyg1
Copy link
Member

miczyg1 commented May 9, 2021

The CPB is additionally enabling core/package C6 states. I have recently discovered some bugs in coreboot around the C6 and its save state area in DRAM. It may be causing problems when CPB is enabled. The patches to coreboot are already sent, so I can test if those resolve your issue. If I understood correctly there is no need for stressing the firewall device to trigger this?

@MyGithubUser01
Copy link
Author

Thank you very much for the feedback, this is correct there is no need to run anything on the firewall.
The most stressful thing I've been running is "top".

I don't even have network cables attached to the firewall.

@miczyg1
Copy link
Member

miczyg1 commented Jun 24, 2021

@MyGithubUser01 I have left OPNsense 21.1 installer running on apu2 over a night yesterday (20 hours elapsed since I left the machine idling) with CPB enabled. Not a single MCA error on the serial console with apu2 v4.14.0.1 which contains fixes I have mentioned in the previous comment. Could you please give v4.14.0.1 a try? Let me know if it helps in your case

@MyGithubUser01
Copy link
Author

Hi,

Thank you very much for the update, I've now updated to 4.14.0.1 and made sure CPB is enabled (looks like it was enabled after flashing). I started the firewall about 24h ago with Serial console and WAN connected, but this morning I only had 5h of uptime and found the below in the console/log. This means it happened after 16-20h.

Looks like it doesn't happen as often/frequent as before - but I'm not sure.

WARNING: attempt to domain_add(netgraph) after domainfinalize()
pid 26616 (python3.7), jid 0, uid 0: exited on signal 11 (core dumped)
[HBSD SEGVGUARD] [/usr/local/bin/python3 (88561)] Suspension expired.
-> pid: 88561 ppid: 82917 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT>
MCA: Bank 1, Status 0x9400000000000151
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0
MCA: CPU 0 COR ICACHE L1 IRD error
MCA: Address 0xffff811f7c40
pid 95751 (python3.7), jid 0, uid 0: exited on signal 10 (core dumped)
[HBSD SEGVGUARD] [/usr/local/bin/python3 (74746)] Suspension expired.
-> pid: 74746 ppid: 65084 p_pax: 0xa50<SEGVGUARD,ASLR,NOSHLIBRANDOM,NODISALLOWMAP32BIT>
MCA: Bank 1, Status 0x9400000000000151
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0
MCA: CPU 0 COR ICACHE L1 IRD error
MCA: Address 0x63b4b76aae0

@miczyg1
Copy link
Member

miczyg1 commented Jul 2, 2021

Somehow I cannot reproduce it and we never run into MCA erros before.

The warning WARNING: attempt to domain_add(netgraph) after domainfinalize() looks suspicious. I have found a similar issue here: https://forum.opnsense.org/index.php?topic=17417.0
Maybe following this thread could help you a bit?

@v1k4
Copy link

v1k4 commented Jul 5, 2021

Brand new APU4D4 here. Crashing multiple times a day when CPB enabled. FW versions v4.14.0.1 and v4.14.0.2.

Currently running Proxmox on Buster and I get many this kind of errors before APU will eventually end up in hanging/crashing.

mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:1 (16:30:1) MC1_STATUS[-|CE|-|AddrV|-|-|-]: 0x9400000000000151
[Hardware Error]: Error Addr: 0x0000ffff86b0b9e0
[Hardware Error]: MC1 Error: Data/tag array parity error for a tag hit.
[Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

Disabling CPB seems to make it stable for now.

@MyGithubUser01
Copy link
Author

MyGithubUser01 commented Jul 7, 2021

This sounds very similar to what I'm experiencing with CPB enabled, thanks for chiming in.
The same address is mentioned: 0x9400000000000151

Are these all related?
https://forum.netgate.com/topic/156761/page-fault-while-in-kernel-mode-on-apu2-after-bios-coreboot-upgrade/4
https://forum.netgate.com/topic/156830/could-you-help-me-analyze-these-crashdumps/5

@miczyg1
Copy link
Member

miczyg1 commented Jul 8, 2021

0x9400000000000151 is not the address but the actual 64bit MC1_STATUS register content. It simply means that the same error was triggered, but the address still may be different. The address where it was triggered is present in this line [Hardware Error]: Error Addr: 0x0000ffff86b0b9e0, the error code is decoded in the following lines.

Still what is written in the forum is not exactly true. CPU Boost does not raise the memory clock frequency, it can't do that because it would require retraining the memory to the new frequency (only BIOS can train the memory, once it is done, the memory frequency is fixed).

CPB is not an overclocking feature! It simply raises the CPU clock frequency to the limits allowed by the CPU specification. Overclocking would be to go higher than what CPB provides (i.e. higher than 1400MHz).

@toredash
Copy link

toredash commented Sep 26, 2021

Wanted to chime in this. I have had the same stability issues described here with my APU2E4 for many years. It would work sometimes for weeks at a time, then it would have stability issues daily over a period of several weeks, then back to a few weeks between each failure.

I have disable CBP now, and so far it looks good. But it is only been a week, so I will have to wait a few months to really be sure.

I've found others that points to the same thing, that CBP causes issues:

https://forum.netgate.com/topic/156761/page-fault-while-in-kernel-mode-on-apu2-after-bios-coreboot-upgrade/38
https://www.reddit.com/r/homelab/comments/lokgyg/solution_to_pc_engines_apu2e4_having_constant/

edit: 37d uptime and no issues encountered after CBP was disabled

@damiankaras damiankaras reopened this Oct 18, 2021
@damiankaras damiankaras transferred this issue from pcengines/coreboot Oct 18, 2021
@toredash
Copy link

After a power-outage, my AP2E4 started to misbehave again, randomly locking up.

Had to check if CPB for some reason had been re-enabled, and for sure, it was Enabled again.

I'll report back if it is still stable.

@mkopec
Copy link
Member

mkopec commented May 17, 2022

Hi @toredash , did you experience any more lock-ups since disabling CPB?

@ghost
Copy link

ghost commented May 20, 2022

Apologies for a "me too" comment but unfortunately for me, the described symptoms in this issue are also somewhat occurring with my APU2E4. Alas, I'm long past the warranty period as I've purchased mine in the summer of 2020. I have not tested with Linux since I am mainly running pfSense on this unit. This occurs with 2.6.0 and a few previous versions.

I've only experienced it randomly crashing and restarting a few times, but since this unit is operating as a firewall for my home connection, I'd rather turn off CPB to make the unit stable again than deal with the instability and MCA errors. If at all possible, I would gladly appreciate a fix to this issue as I could use the extra performance to handle bursty traffic flows since I have a gigabit internet connection. I am willing to turn CPB on again and offer my help in debugging the problem.

As mentioned just above, I also see the MCA errors with CPB enabled but unfortunately I don't have the logs anymore (I've stumbled across this issue randomly when reading the documentation for something unrelated), but they appear very similar and the error seemed to have been associated with CPU2 in my case. With CPB off, I do not see them ever appear in the syslog. I've since then done a fresh install and why I don't have the logs anymore.

I do however, still see errors in the logs of pfSense that are seemingly related to the firmware and the first i210 NIC and I don't know if it's related.

May 19 22:43:04 	kernel 		igb2: netmap queues/slots: TX 4/1024, RX 4/1024
May 19 22:43:04 	kernel 		igb2: Ethernet address: <REDACTED>
May 19 22:43:04 	kernel 		igb2: Using MSI-X interrupts with 5 vectors
May 19 22:43:04 	kernel 		igb2: Using 4 RX queues 4 TX queues
May 19 22:43:04 	kernel 		igb2: Using 1024 TX descriptors and 1024 RX descriptors
May 19 22:43:04 	kernel 		igb2: NVM V0.6 imgtype5
May 19 22:43:04 	kernel 		igb2: <Intel(R) I210 Flashless (Copper)> port 0x3000-0x301f mem 0xd0200000-0xd021ffff,0xd0220000-0xd0223fff irq 36 at device 0.0 on pci3
May 19 22:43:04 	kernel 		pci3: <ACPI PCI bus> on pcib3
May 19 22:43:04 	kernel 		pcib3: <ACPI PCI-PCI bridge> irq 27 at device 2.4 on pci0
May 19 22:43:04 	kernel 		igb1: netmap queues/slots: TX 4/1024, RX 4/1024
May 19 22:43:04 	kernel 		igb1: Ethernet address: <REDACTED>
May 19 22:43:04 	kernel 		igb1: Using MSI-X interrupts with 5 vectors
May 19 22:43:04 	kernel 		igb1: Using 4 RX queues 4 TX queues
May 19 22:43:04 	kernel 		igb1: Using 1024 TX descriptors and 1024 RX descriptors
May 19 22:43:04 	kernel 		igb1: NVM V0.6 imgtype5
May 19 22:43:04 	kernel 		igb1: <Intel(R) I210 Flashless (Copper)> port 0x2000-0x201f mem 0xd0100000-0xd011ffff,0xd0120000-0xd0123fff irq 32 at device 0.0 on pci2
May 19 22:43:04 	kernel 		pci2: <ACPI PCI bus> on pcib2
May 19 22:43:04 	kernel 		pcib2: <ACPI PCI-PCI bridge> irq 26 at device 2.3 on pci0
May 19 22:43:04 	kernel 		igb0: netmap queues/slots: TX 4/1024, RX 4/1024
May 19 22:43:04 	kernel 		igb0: Ethernet address: <REDACTED>
May 19 22:43:04 	kernel 		igb0: Using MSI-X interrupts with 5 vectors
May 19 22:43:04 	kernel 		igb0: Using 4 RX queues 4 TX queues
May 19 22:43:04 	kernel 		igb0: Using 1024 TX descriptors and 1024 RX descriptors
May 19 22:43:04 	kernel 		igb0: NVM V0.6 imgtype5
May 19 22:43:04 	kernel 		igb0: <Intel(R) I210 Flashless (Copper)> mem 0xd0000000-0xd001ffff,0xd0020000-0xd0023fff irq 28 at device 0.0 on pci1
May 19 22:43:04 	kernel 		pci1: <ACPI PCI bus> on pcib1
May 19 22:43:04 	kernel 		pcib1: failed to allocate initial I/O port window: 0x1000-0x1fff
May 19 22:43:04 	kernel 		pcib1: <ACPI PCI-PCI bridge> irq 25 at device 2.2 on pci0
May 19 22:43:04 	kernel 		pci0: <base peripheral, IOMMU> at device 0.2 (no driver attached)
May 19 22:43:04 	kernel 		pci0: <ACPI PCI bus> on pcib0
May 19 22:43:04 	kernel 		pcib0: could not evaluate _ADR - AE_NOT_FOUND
May 19 22:43:04 	kernel 		pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 

The relevant lines/errors:

May 19 22:43:04 kernel pcib1: failed to allocate initial I/O port window: 0x1000-0x1fff

and

May 19 22:43:04 kernel pcib0: could not evaluate _ADR - AE_NOT_FOUND

Again, I don't know if the above lines/errors are relevant to the instability issue with CPB at hand.

@toredash
Copy link

Hi @toredash , did you experience any more lock-ups since disabling CPB?

No, my device has been stable since CPB was disabled.

@daduke
Copy link

daduke commented Jun 8, 2022

FTR I've also had bad stability issues with an apu6 that seem to be solved by disabling CPB. Maybe CPB shouldn't be enabled by default?

@toredash
Copy link

toredash commented Sep 4, 2022

Update: My device is still stable after several months.

@toredash
Copy link

toredash commented Nov 1, 2022 via email

@mdickers47
Copy link

FWIW, I have a very old APU2 that became extremely unstable with CPB and Linux 6.2.5. It generates a lot of different "null pointer dereference," "unable to handle page fault," and "soft lockup" panics. It lasts no more than a few hours per reboot, and sometimes only a few minutes. Disabling CPB in the BIOS seems to have solved it.

Here is one of the common panics:

[ 4238.591613] BUG: kernel NULL pointer dereference, address: 0000000000000003
[ 4238.598622] #PF: supervisor write access in kernel mode
[ 4238.603856] #PF: error_code(0x0002) - not-present page
[ 4238.609046] PGD 106842067 P4D 106842067 PUD 103a51067 PMD 0 
[ 4238.614744] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 4238.619132] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.2.5-arch1-1 #1 fcf70e9d97e045884ea945a3d5b5ff73b06f7a27
[ 4238.629245] Hardware name: PC Engines apu2/apu2, BIOS v4.19.0.1 01/31/2023
[ 4238.636132] RIP: 0010:psi_group_change+0x2f/0x400
[ 4238.640906] Code: 41 57 48 63 c6 49 89 ff 41 56 41 55 41 54 41 89 cc 55 53 48 83 ec 20 48 8b 5f 30 48 03 1c c5 c0 da eb ab 4c 89 04 24 83 03 01 <44> 89 4c 24 10 48 89 44 24 08 f6 c2 10 0f 85 ea 02 00 00 f6 c1 10
[ 4238.659677] RSP: 0018:ffff9b4e800d3dc0 EFLAGS: 00010002
[ 4238.664929] RAX: 0000000000000003 RBX: ffffbb4e7fd81dc0 RCX: 0000000000000010
[ 4238.672104] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff8c0a40d61800
[ 4238.679258] RBP: 0000000000000003 R08: 000003dadfbf0c5d R09: 0000000000000001
[ 4238.686442] R10: 0000000000000001 R11: 0000000000000100 R12: 0000000000000010
[ 4238.693593] R13: 0000000000000003 R14: ffff8c0a46408000 R15: ffff8c0a40d61800
[ 4238.700744] FS:  0000000000000000(0000) GS:ffff8c0a6ad80000(0000) knlGS:0000000000000000
[ 4238.708852] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4238.714618] CR2: 0000000000000003 CR3: 0000000106eb0000 CR4: 00000000000406e0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants