Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hard crash on Intel(R) Xeon(R) CPU E5-2690 (06-2d-07) #15

Closed
bulhoes opened this issue Oct 4, 2019 · 30 comments
Closed

hard crash on Intel(R) Xeon(R) CPU E5-2690 (06-2d-07) #15

bulhoes opened this issue Oct 4, 2019 · 30 comments
Assignees

Comments

@bulhoes
Copy link

bulhoes commented Oct 4, 2019

Hello,

I have some Oracle x3-2 servers (formerly SUN FIRE X4270 M3) running on the E5-2690 cpu, and after the upgrade of the latest microcode released on 19 of Jun, I have been experiencing some server crashes on those servers.

I have some servers running flavors on centos 6 and centos 7, with different kernels and they suffer from the same issue.

Some of the servers I've even updated the BIOS to the latest one that imports the intel microcode into the bios, and they still have the same issue.

The issue is "reproducible" in a way that if I have enough load on the servers, they will eventually crash and hard reset without any logs on the OS.

The microcode revision reported by the kernel is 0x718.

The servers in question have an ilom subsystem, and the ilom subsystem reports hardware issues, sometimes on the cpu as failed, and other times on a faulty component that no automated diagnosis is available to identify. I do not have support from Oracle for these servers.

I even started by acquiring "new" cpus for the servers, as I suspected it could be from overheating, as they were on the bottom of a rack. That didn't help at all with the issue.

When the servers started to fail, we upgraded the bios of some of them, which enforces also the 0x718 microcode revision.

On the other hand, having some other servers similar to those, I've patched the servers last friday, and straight away one of them started to have hard resets.
The other one took longer, but it suffers from the, also.

The os on them is different, and so is the kernel.

Some of them are running a flavour of redhat 6, the ones with the BIOS updated, and the other ones with only the os updated are running centos 7 with mainline kernel from elrepo.
So the os stack is different, and the kernels are very different across the servers.
Even though, the behaviour is very similar, and the "crashes" occur across the board.

Because I didn't update the firmware on the centos 7 servers, I was able to "downgrade" the microcode, and now I'm running microcode 0x714 on those servers.

They don't have enough uptime for me to say it "fixes" the issue, but I will keep this post updated with more information asap.

Has anyone be facing similar issues with this cpu and microcode?

I've been looking around and trying different solutions, but considering my samples this points me to the microcode deploy.

The second servers I was able to downgrade are running storage, and they were stable for a long time (more than a year) before this update.

With the right load, I usually can get a crash on a server in about 3 hours.
That is not a normal workload, but a simulated high workload to force the issue to occur.

Any ideas would be really appreciated, since I'm looking into this for some time now, and I'm without any leads to follow.

Thank you.

Kind regards,
Jorge Bulhoes

@mcu-administrator mcu-administrator self-assigned this Oct 4, 2019
@mcu-administrator
Copy link
Contributor

@bulhoes Thanks for reaching out to us. We are currently debugging an issue with Oracle that we believe is the same issue. We will provide an update as this progresses.

@bulhoes
Copy link
Author

bulhoes commented Oct 4, 2019

Hello @mcu-administrator,

Considering that I think I can reproduce the issue in under a day, I'm available to test any solutions that you might deem important.

@mcu-administrator
Copy link
Contributor

@bulhoes Thanks for the offer. We currently have a system in-house as well. When we have something for testing, we'll be in touch.

@jplindquist
Copy link

I don't know if this is related to what we've been seeing as well, but after upgrading several of our CentOS 7 servers with microcode_ctl-2.1-47.5.el7_6.x86_64 or greater, we've been seeing intermittent shutdowns with nothing in the logs as well. We traced it back to what appears to be something in the updated microcode_ctl package and rolled them all back to microcode_ctl-2.1-47.4 (0x714) where things seem to be stable.

Are you able to confirm if this is related, or should I treat this as a separate issue? Here's some additional info below regarding our systems if it helps:

CentOS Linux release 7.7.1908 (Core)
Linux 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019

vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz
stepping	: 7

@esyr-rh
Copy link
Contributor

esyr-rh commented Oct 11, 2019

$ rpm -qp --provides microcode_ctl-2.1-47.5.el7_6.x86_64.rpm|grep 'cpuid:000206d7'
iucode_date(fname:intel/06-2d-07;cpuid:000206d7;pf_mask:0x6d) = 2019.05.21
iucode_rev(fname:intel/06-2d-07;cpuid:000206d7;pf_mask:0x6d) = 0x718

Microcode revision and CPUID match, so it may be related.

@bulhoes
Copy link
Author

bulhoes commented Oct 14, 2019

Hello @jplindquist ,

The problem is indeed related to the microcode 0x718 for those cpus.

If you run the 0x714 from the package microcode_ctl-2.1-47.4, like you did, you shouldn't have any issues.

I have an uptime of over 10 days now, that I never had before the downgrade.

Jorge

@jplindquist
Copy link

Thank you for the follow ups! We have downgraded on the servers with the affected CPUs, and have been stable since as well. I'll keep following here for any further updates or resolution 👍

@esyr-rh
Copy link
Contributor

esyr-rh commented Oct 16, 2019

FYI, microcode_ctl-2.1-47.7.el7_6[1] (and microcode_ctl-2.1-53.2.el7_7[2]) include a workaround for this issue by providing version of the 06-2d-07 microcode file containing revision 0x714 by default (revision 0x718 can be enabled explicitly, though).

[1] https://access.redhat.com/errata/RHEA-2019:3091
[2] https://access.redhat.com/errata/RHEA-2019:3111

@hmh
Copy link

hmh commented Oct 17, 2019

Do we have a confirmation that VMWERV is broken on SNB-EP 0x718 ?

If so, should distros downgrade SNB-EP microcode from 0x718 to 0x714 while we wait for a proper bugfixed release from Intel?

@bulhoes
Copy link
Author

bulhoes commented Nov 6, 2019

Hello @mcu-administrator,

I'm sorry to be pushing for an update, but I need to understand if there are any progresses on this subject.
I have 3 servers completely parked because of this, and I need to understand if I will be able to "revive" them, or if I need to procure replacement for them.

Thank you.

Kind regards,
Jorge

@hmh
Copy link

hmh commented Nov 6, 2019

@bulhoes: 0x718 only "breaks" servers that host VMs. If you're running bare-metal or containers (such as docker), it seems to be working fine (note: I don't work for @intel, they might know more, but so far have provided no further guidance).

@bulhoes
Copy link
Author

bulhoes commented Nov 6, 2019

@hmh
My experience is very different than that!
The servers I have stopped are in fact hypervisors.
Those are parked, as I updated the BIOS and got the microcode from the BIOS. Need a BIOS update to fix it. That is it.

On the other hand, I had a couple of servers running storage, and after the update of the microcode_ctl to the 0x718, I started seeing the same behaviour. As so, from my perspective it's not only for hypervisors but for some other workloads also. With the downgrade of the microcode package and a power cycle, the servers are stable.

Either way, this is to say that although it might be more visible on virtualization, it affects other workloads equally.

Hope it helps.

Kind regards,
Jorge

@hmh
Copy link

hmh commented Nov 6, 2019

@bulhoes: thanks for the report. Indeed, it is different from what we observed at work, and a relevant data point.

@jplindquist
Copy link

Our issues were very different as well. All of our affected servers were not hosting VMs, and were bare metal. It does seem specific to the workloads, either some kind of call or series of calls that would trigger the system halt. We haven't been able to narrow it down beyond that, but have been stable on the rolled back version until there's a fix in place.

@hmh
Copy link

hmh commented Nov 14, 2019

FWIW, our servers using the 0x718 microcode are running up-to-date Ubuntu 18 and up-to-date docker-ce. They're Dell PowerEdge R420 servers with BIOS 2.4.2 (will be updated soon). The kernel is Ubuntu's 4.15 kernel.

Workload is build farms (gcc and the like).

So far, they have never shown any issues, either while idle, or fully loaded.

@esyr-rh
Copy link
Contributor

esyr-rh commented Nov 20, 2019

FWIW, our servers using the 0x718 microcode

May I ask, is the microcode update performed by the system firmware or by the Linux kernel?

@hmh
Copy link

hmh commented Nov 20, 2019

FWIW, our servers using the 0x718 microcode

May I ask, is the microcode update performed by the system firmware or by the Linux kernel?

The update is being applied by the Linux kernel (early mode).

@bulhoes
Copy link
Author

bulhoes commented Nov 21, 2019

Hello @hmh and @esyr-rh

The microcode can come from both sides.

If you updated the bios of the server after the release of the 0x718 microcode, it can be coming from the BIOS. If you didn't, it has to be coming from the microcode_ctl package, applied in the boot process. Of course I'm saying if that BIOS package was released after the Jun 19.

You can downgrade the microcode_ctl to a release before Jun 19th (that is the date of the release of the microcode here in github), and make sure the init ramdisks are rebuilt with the new microcode, power off and on the server. You can then take a look into /proc/cpuinfo and see the microcode level there.

If you get something older than the 0x718, you should be ok. The 0x714 is running fine on my servers.

If you updated the BIOS, you might look into the option of downgrading the BIOS. That is a last resource scenario, and I wouldn't recommend it lightly.

The servers I had updated the BIOS with the 0x718 microcode are oracle x3-2 (formerly the SUN FIRE X4170 M3). On this department, I saw online that although wouldn't be the first choice, in case it is really needed, oracle says you can downgrade to any release that is available for download on their site.

With this information, I've downgraded the BIOS of the servers that I had stopped, and they are working happily now. I've done this 1 or 2 days, but I trashed them with load overnight and they are working without any issues.

The BIOS downgrade is always a risky operation, but in my case paid of, and I have the servers operational again.

Jorge

@hmh
Copy link

hmh commented Nov 22, 2019

@bulhoes we know :) It was a datapoint collection.

The servers at work are not showing any of the regressions, and they are being updated by the Operating System to microcode 0x718 -- i.e. they are running stable on 0x718 just fine. That's a data point. It might be that my workload does not trigger the regression, though.

@bulhoes
Copy link
Author

bulhoes commented Nov 22, 2019

Hello @hmh I would like to ask you to make a test for me if you can, and whenever you can.

I would like to know if you can leave one server running stress with -m c and -m over a weekend.

Pex if you have 32 cores on the os reported, would you be able to leave this command running over a weekend:

stress -m 32 -c 32

This will for sure keep you cpu busy, and will give us a fairly accurate stability report of the server.

I know it's complete server trash, but....

Also, there is something that we are not comparing, and might make a big difference?

What server/board are you using the cpus on?

As I stated before I'm using old SUN FIRE X4270 M3 rebranded as Oracle x3-2 and these server have a lot of more hardware around to cause trouble. Some interactions with the other parts of the hardware can cause the issue, as I noted, the ILOM was marking the cpu as bad with need for replacement.

So, if you are using systems with less complexity, maybe that is why you are not observing the same behaviour.

Please let me know what hardware you are using.

Kind regards,
Jorge

@hmh
Copy link

hmh commented Dec 13, 2019

@bulhoes : They are all in production right now, and I failed to secure an extended maintenance window that would allow for such testing so far.

Do your affected servers reproduce the hang when you run "stress" on them? Otherwise, it is doubtful ours would...

Our servers with Xeon E5-2690 processors are a batch of Dell R420, which means they run on Dell motherboards. Dell's older iDRAC (I can't recall what generation of iDRAC comes with the R420 right now, but it is not anywhere as capable as the stuff we got in newer R440's) is likely a lot simpler than the BMC in a SUN FIRE.

@ranshalit
Copy link

ranshalit commented Dec 25, 2019

Hello,

It seems that I meet a similar trouble, in which we are struggling for a month now.
I would appreciate any feedback.
We already been trying to reach assistance, but got none.

https://forums.intel.com/s/question/0D70P000006Y0NU/cant-access-ram-from-pcie-when-using-xeon?s1oid=00DU0000000YT3c&s1nid=0DB0P000000U1Hq&emkind=chatterCommentNotification&s1uid=0050P000008KoEs&emtm=1575043703545&fromEmail=1&s1ext=0

https://bugzilla.kernel.org/show_bug.cgi?id=205701

In short:

  1. we use a kernel module in several intel computers, which allocates physical memory and provides the address to FPGA. The FPGA tried to do DMA transfer into Xeon's RAM, but fails.
  2. Trying to upgrade the kernel from 4.x to latest 5.4 fails/hangs/feeeze during boot without any errors.

**Is it a bug in Xeon chip or a bug in Linux ? Or a bug in BIOS ? **

Please help!
Ran

@esyr-rh
Copy link
Contributor

esyr-rh commented Dec 25, 2019

https://forums.intel.com/s/question/0D70P000006Y0NU/cant-access-ram-from-pcie-when-using-xeon?s1oid=00DU0000000YT3c&s1nid=0DB0P000000U1Hq&emkind=chatterCommentNotification&s1uid=0050P000008KoEs&emtm=1575043703545&fromEmail=1&s1ext=0

Here, microcode revision is stated to be 0x714, is it staying the same or has been updated since?

  1. Trying to upgrade the kernel from 4.x to latest 5.4 fails/hangs/feeeze during boot without any errors.

Had the 4.x kernel MDS mitigations included?

The issue may be considered similar only if the microcode is at revision 0x718 and the kernel contains MDS mitigations (VERW instructions), as it is (presumably) the set of conditions that triggers hangs on 0x206d7 CPUs. Also note that the observed hangs reported in this bug are triggered randomly under load (of undetermined nature) and not during boot.

@esyr-rh
Copy link
Contributor

esyr-rh commented May 21, 2020

Hello, microcode-20200520 release includes updated SNB-EP microcode files (revision 0x621 for 06-2d-06 and 0x71a for 06-2d-07) that might solve the issue, maybe it is worth trying it out.

@bulhoes
Copy link
Author

bulhoes commented May 29, 2020

Hello,

I will give it a try as soon as I can.

I will provide feedback here on the results I observe.

Thank you.

@hmh
Copy link

hmh commented Jun 9, 2020

@bulhoes, have you managed to test your SNB-EP ? It would be nice to get the word out that people clan update their SNB-EP servers if they are holding out due to this issue...

Thanks!

@bulhoes
Copy link
Author

bulhoes commented Jun 12, 2020

Hello @hmh,

I'm looking just now to see if I can update one of the servers to leave it with load over the weekend.
Unfortunately ovm still doesn't have a package with the new microcode. Trying to see if I can find a way to get it in manually just to ensure the microcode is correct and working fine.

I will update with news if I have any more info.

Regads,

@bulhoes
Copy link
Author

bulhoes commented Jun 24, 2020

Hello,

I'm successfully running a couple of servers with the microcode release 0x71a without any issues.
I would like to ask if someone else would be able to do some testing, so we can close this issue.

Thank you.

Regards.

@whpenner
Copy link

whpenner commented Aug 14, 2020

Since microcode update 0x71a addresses the issue (same for 0x621 for 06-2d-06) and no further reports have been received, this issue is ready to be closed.

@bulhoes
Copy link
Author

bulhoes commented Aug 26, 2020

I haven't seen are more issues since the new microcode update. As so, I'm happy with closing this issue.

Thank you for your support.

@bulhoes bulhoes closed this as completed Aug 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants