Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify why Ampere altras are restarting and not booting properly #2894

Closed
sxa opened this issue Mar 14, 2022 · 37 comments
Closed

Identify why Ampere altras are restarting and not booting properly #2894

sxa opened this issue Mar 14, 2022 · 37 comments

Comments

@sxa
Copy link
Member

sxa commented Mar 14, 2022

This has happened multiple times recently. For some reason it's restarting itself and not coming back. We need to identify why it's rebooting (Error condition, patching, or something else) and then see why it's not coming back (Separate test - perhaps try rebooting in an idle time and see if it comes back)

Current recovery process it to connect to the out-of-band console (details in the Equinix UI) and exit from the Shell> prompt.

@richardlau
Copy link
Member

I thought the problematic one was ubuntu2004_docker-arm64-1?
Refs: #2820 (comment)
Refs: #2835 (comment)

@sxa sxa changed the title Identify why ubuntu2004_docker-arm64-2 is restarting and not booting properly Identify why ubuntu2004_docker-arm64-1 is restarting and not booting properly Mar 14, 2022
@sxa
Copy link
Member Author

sxa commented Mar 14, 2022

Changed the title

@richardlau
Copy link
Member

And today it looks like test-equinix-ubuntu2004_docker-arm64-2 is down 😞. Logged into the out-of-band console and it was on the UEFI CLI. Typed exit at the prompt and then selected GNU/Linux at the GRUB menu and the machine booted.

@richardlau richardlau changed the title Identify why ubuntu2004_docker-arm64-1 is restarting and not booting properly Identify why Ampere altras are restarting and not booting properly Apr 14, 2022
@richardlau
Copy link
Member

Looks like test-equinix-ubuntu2004_docker-arm64-2 is down again. It was stuck on the UEFI CLI again -- I've exited it and it's booting.

@richardlau
Copy link
Member

And again test-equinix-ubuntu2004_docker-arm64-2 had restarted and was stuck on the UEFI CLI.

@richardlau
Copy link
Member

test-equinix-ubuntu2004_docker-arm64-2 had restarted again and was stuck on the UEFI CLI. Logged into to the OOB console and exited the CLI.

@richardlau
Copy link
Member

Noticed the containers on test-equinix-ubuntu2004_docker-arm64-2 are all down again. Logged into the OOB console and exited the UEFI CLI again.

@richardlau
Copy link
Member

Containers on test-equinix-ubuntu2004_docker-arm64-2 are all offline again.

@richardlau
Copy link
Member

(Is it too optimistic to hope the planned maintenance makes a difference? 🙂)

@sxa
Copy link
Member Author

sxa commented May 17, 2022

(Is it too optimistic to hope the #2948 makes a difference? slightly_smiling_face)

I suspect so ;-)

I brought it back online earlier today and will contact WorksOnArm regarding the failures.

It seems to be throwing a few of these before it dies, although it manages to recover from quite a lot of them too:

May 13 17:56:46 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23448.790563] "node" (999554) uses deprecated CP15 Barrier instruction at 0x11a4a9c
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526304] {73}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526311] {73}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526314] {73}[Hardware Error]: event severity: corrected
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526317] {73}[Hardware Error]:  Error 0, type: corrected
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526324] {73}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526326] {73}[Hardware Error]:   section length: 0x30
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526332] {73}[Hardware Error]:   00000000: 40000003 00000000 00400000 00462030  ...@......@.0 F.
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526336] {73}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:03:10 test-equinix-ubuntu2004-docker-arm64-2 kernel: [23832.526338] {73}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666503] {74}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666509] {74}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666512] {74}[Hardware Error]: event severity: corrected
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666515] {74}[Hardware Error]:  Error 0, type: corrected
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666522] {74}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666524] {74}[Hardware Error]:   section length: 0x30
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666531] {74}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@......@.0 F.
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666534] {74}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24140.666537] {74}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879202] {75}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879208] {75}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879211] {75}[Hardware Error]: event severity: corrected
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879214] {75}[Hardware Error]:  Error 0, type: corrected
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879221] {75}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879222] {75}[Hardware Error]:   section length: 0x30
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879229] {75}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@......@.0 F.
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879232] {75}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:13:31 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24453.879235] {75}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326137] {76}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326145] {76}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326147] {76}[Hardware Error]: event severity: corrected
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326150] {76}[Hardware Error]:  Error 0, type: corrected
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326157] {76}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326158] {76}[Hardware Error]:   section length: 0x30
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326166] {76}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@......@.0 F.
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326169] {76}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:16:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [24621.326172] {76}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754400] {77}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754406] {77}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754408] {77}[Hardware Error]: event severity: corrected
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754411] {77}[Hardware Error]:  Error 0, type: corrected
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754418] {77}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754419] {77}[Hardware Error]:   section length: 0x30
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754427] {77}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@......@.0 F.
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754430] {77}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:24:49 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25131.754433] {77}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069449] {78}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069456] {78}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069458] {78}[Hardware Error]: event severity: corrected
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069461] {78}[Hardware Error]:  Error 0, type: corrected
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069470] {78}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069471] {78}[Hardware Error]:   section length: 0x30
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069478] {78}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@......@.0 F.
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069481] {78}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:38:53 test-equinix-ubuntu2004-docker-arm64-2 kernel: [25976.069484] {78}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552450] {79}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552457] {79}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552460] {79}[Hardware Error]: event severity: corrected
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552463] {79}[Hardware Error]:  Error 0, type: corrected
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552471] {79}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552473] {79}[Hardware Error]:   section length: 0x30
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552480] {79}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@......@.0 F.
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552483] {79}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:42:51 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26213.552486] {79}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123337] {80}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123344] {80}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123346] {80}[Hardware Error]: event severity: corrected
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123349] {80}[Hardware Error]:  Error 0, type: corrected
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123356] {80}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123357] {80}[Hardware Error]:   section length: 0x30
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123364] {80}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@......@.0 F.
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123367] {80}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 18:46:50 test-equinix-ubuntu2004-docker-arm64-2 kernel: [26453.123370] {80}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802232] {81}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802239] {81}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802242] {81}[Hardware Error]: event severity: corrected
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802245] {81}[Hardware Error]:  Error 0, type: corrected
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802253] {81}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802254] {81}[Hardware Error]:   section length: 0x30
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802262] {81}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@......@.0 F.
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802265] {81}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 19:02:34 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27396.802267] {81}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949286] {82}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949293] {82}[Hardware Error]: It has been corrected by h/w and requires no further action
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949295] {82}[Hardware Error]: event severity: corrected
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949298] {82}[Hardware Error]:  Error 0, type: corrected
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949306] {82}[Hardware Error]:   section type: unknown, e8ed898d-df16-43cc-8ecc-54f060ef157f
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949307] {82}[Hardware Error]:   section length: 0x30
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949315] {82}[Hardware Error]:   00000000: 40010003 00000000 00400000 00462030  ...@......@.0 F.
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949318] {82}[Hardware Error]:   00000010: 00000001 0000e000 00000000 00000000  ................
May 13 19:08:18 test-equinix-ubuntu2004-docker-arm64-2 kernel: [27740.949321] {82}[Hardware Error]:   00000020: 00000000 00000003 00000000 00000000  ................
May 16 11:54:43 test-equinix-ubuntu2004-docker-arm64-2 kernel: [    0.000000] Booting Linux on physical CPU 0x0000120000 [0x413fd0c1]

@richardlau
Copy link
Member

Both machines were offline over the weekend, stuck on the UEFI CLI #2959. I've logged into the OOB console on both and exited the CLI.

@sxa
Copy link
Member Author

sxa commented Jun 16, 2022

It looks like one of them may not have been started after the previous maintenance window. For the other one (which has been unreliable for us) Equinix have provided me with a replacement which I'm provisioning with Ubuntu 20.04 just now and will be up as test-equinix-ubuntu2004-arm64-3 so we can migrate off the unstable one and leave it to them to analyse the fault.

@sxa sxa self-assigned this Jun 16, 2022
@richardlau
Copy link
Member

The second one (-2) was offline again. I've gone into the OOB console and exited the UEFI prompt.

@richardlau
Copy link
Member

Rescued the second Altra again this morning.

@sxa
Copy link
Member Author

sxa commented Jun 20, 2022

Looks to be down again. Let's not bring it back. I've got the playbook running at the moment which will bring up the -3 machine with direct replacements (same names) as the containers on the defective -2 system.

(For anyone watching along, the firewall rules have been switched to replace -2 with -3 so there should be no risk of both machines connecting together)

@pgmwoa
Copy link

pgmwoa commented Jun 29, 2022

@sxa , @richardlau , Request you to delete the problematic Altra server (Mt Jade under WoA) that is not used so that there is no confusion when the Equinix support team reclaims it. We need that deleted and freed for further investigation. Currently, all the 3 Mt Jade servers are showing as provisioned and active.
@sxa Please confirm via response to the email dated 27th Jun w/ subject " Node.js - Works On Arm Sponsored - Stability issue".
Thnx
WoA Program Team

@richardlau
Copy link
Member

I've deleted the Altra that had ip address 139.178.85.13.

@sxa
Copy link
Member Author

sxa commented Jun 30, 2022

Confirmed via email

@richardlau
Copy link
Member

Looks like the first Altra restarted around 5 and a half hours ago and was stuck on the UEFI prompt. I've logged into the OOB console and exited.

@richardlau
Copy link
Member

Recovered test-equinix-ubuntu2004-docker-arm64-1 again from the UEFI prompt.
I saw this while the machine was booting (after the prompt was exited):

[    0.925839] tpm_crb MSFT0101:00: [Firmware Bug]: ACPI region does not cover the entire command/response buffer. [mem 0x88500000-0x88500fff flags 0x201] vs 88500038 1000
[    1.011928] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.018605] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.025254] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.031897] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12
[    1.039030] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.045686] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.052330] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.058972] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12

Ubuntu 20.04.4 LTS test-equinix-ubuntu2004-docker-arm64-1 ttyAMA0

test-equinix-ubuntu2004-docker-arm64-1 login:

@richardlau
Copy link
Member

Recovered test-equinix-ubuntu2004-docker-arm64-1 again from the UEFI prompt.
Same messages as before when booting:

[    0.892690] tpm_crb MSFT0101:00: [Firmware Bug]: ACPI region does not cover the entire command/response buffer. [mem 0x88500000-0x88500fff flags 0x201] vs 88500038 1000
[    0.980799] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    0.987482] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    0.994141] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.000805] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12
[    1.008286] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.014963] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.021617] cma: cma_alloc: alloc failed, req-size: 256 pages, ret: -12
[    1.028270] cma: cma_alloc: alloc failed, req-size: 128 pages, ret: -12

Ubuntu 20.04.4 LTS test-equinix-ubuntu2004-docker-arm64-1 ttyAMA0

test-equinix-ubuntu2004-docker-arm64-1 login:

@sxa
Copy link
Member Author

sxa commented Jul 22, 2022

Most recent jobs before the crash seem to have been centos7-arm64-gcc6 ones -although they were listed as SUCCESS (This is from the jenkins server log):

2022-07-16 06:08:53:620 - AuditLog - node-test-commit-arm » centos7-arm64-gcc6 #42769 Started by upstream project "node-test-commit-arm" build number 42,769, Parameters:[NODEJS_VERSION: {12.22.13}, NODEJS_MAJOR_VERSION: {12}] on node test-equinix-centos7_container-arm64-2 started at 2022-07-16T10:02:12Z completed in 392437ms completed: SUCCESS
2022-07-17 06:09:04:086 - AuditLog - node-test-commit-arm » centos7-arm64-gcc6 #42781 Started by upstream project "node-test-commit-arm" build number 42,781, Parameters:[NODEJS_VERSION: {12.22.13}, NODEJS_MAJOR_VERSION: {12}] on node test-equinix-centos7_container-arm64-2 started at 2022-07-17T10:02:14Z completed in 400434ms completed: SUCCESS
2022-07-18 06:11:07:881 - AuditLog - node-test-commit-arm » centos7-arm64-gcc6 #42808 Started by upstream project "node-test-commit-arm" build number 42,808, Parameters:[NODEJS_VERSION: {12.22.13}, NODEJS_MAJOR_VERSION: {12}] on node test-equinix-centos7_container-arm64-2 started at 2022-07-18T10:02:16Z completed in 523189ms completed: SUCCESS

NOTES:
The above is from using the output of using egrep - "test-equinix-centos7_container-arm64-2|test-equinix-ubuntu2004_sharedlibs_container-arm64-2|test-equinix-ubuntu1804_sharedlibs_container-arm64-2|test-equinix-ubuntu2004_sharedlibs_container-arm64-1|test-equinix-ubuntu1804_container-arm64-1|test-equinix-centos8_container-arm64-1|test-equinix-rhel8_container-arm64-1|test-equinix-ubuntu2004_container-armv7l-1|test-equinix-centos7_container-arm64-1|test-equinix-ubuntu2004_sharedlibs_container-arm64-3|test-equinix-ubuntu1804_sharedlibs_container-arm64-1|test-equinix-ubuntu2004_container-arm64-1|test-equinix-debian10_container-armv7l-1|test-equinix-ubuntu1804_sharedlibs_container-arm64-3" against the jenkins log which shows all the stuff about the containers on that host.

In case there are any issues specific to centos7-arm64-gcc6 I'm going to run a few rebuids of https://ci.nodejs.org/job/node-test-commit-arm/42880 which is ONLY building that one.

@sxa
Copy link
Member Author

sxa commented Jul 28, 2022

Have taken the second centos7 container offline and currently repeatedly running the centos7 gcc6 job repeatedly on the "failing" altra. I will also add in the ubuntu2004-armv7l combination in future runs as that is potentially more suspect than the others and bring test-equinix-centos7_container-arm64-2 from the other machine offline for now too.

Running as builds https://ci.nodejs.org/job/node-test-commit-arm 42988 up to 43000 which is running:

And builds https://ci.nodejs.org/job/node-test-commit-arm 43001 up to 43010 which is running:

@joyeecheung
Copy link
Member

It seems the issue is happening again #3022, it has been blocking the CI for a while

@sxa
Copy link
Member Author

sxa commented Aug 30, 2022

I've brought https://ci.nodejs.org/computer/test-equinix-ubuntu2004_container-armv7l-2/ back online to clear the backlog.

test-equinix-ubuntu2004-arm64-1 - 145.40.81.219 - had gone offline for the first time in a while so we'll need to re-evaluate what's going on here. That's the first outage we've had in a few weeks on that server. It's now back and so there are two executors for the
ubuntu2004-armv7l jobs available again.

@richardlau
Copy link
Member

Had to log into the oob console for test-equinix-ubuntu2004-arm64-1 today to exit the UEFI prompt.

@richardlau
Copy link
Member

Had to recover test-equinix-ubuntu2004-arm64-1 today in the usual way.

@richardlau
Copy link
Member

test-equinix-ubuntu2004-arm64-1 had rebooted/was stuck again today 😞. I've recovered it.

@richardlau
Copy link
Member

Have taken the second centos7 container offline

@sxa FYI I've brought back the second container to help process the job queue.

@richardlau
Copy link
Member

test-equinix-ubuntu2004-arm64-1 was stuck again and has now been recovered.

@richardlau
Copy link
Member

Looks like all the containers on test-equinix-ubuntu2004-arm64-1 are offline again. I'm not sure for how long as there's no build history for any of them (we delete old build history, but I forget how far back the cut off is).

I'm in a meeting now, but I'll look at the host after it -- I suspect the host is stuck on the UEFI boot prompt again..

@richardlau
Copy link
Member

Looks like all the containers on test-equinix-ubuntu2004-arm64-1 are offline again.
...
I suspect the host is stuck on the UEFI boot prompt again..

It was. I've logged into the out of band console and exited the UEFI prompt. Host is back online and the containers are processing jobs.

Copy link

github-actions bot commented Dec 1, 2023

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

@github-actions github-actions bot added the stale label Dec 1, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 1, 2024
@sxa sxa reopened this Jan 3, 2024
@targos
Copy link
Member

targos commented Jan 3, 2024

@sxa I think this was fixed in the context of #3492

@sxa
Copy link
Member Author

sxa commented Jan 3, 2024

Interesting - I thought we had that applied previously on the machines - @richardlau how confident are you that we're ok with this on all the systems now? Wer had two issues - the fact it was falling over on its own and the fact that it didn't come back up (which sounds like it's what's resolved on -3)

@richardlau
Copy link
Member

Re. "didn't come back up" we had two issues:

  1. Machine rebooted into UEFI prompt. No idea what was causing this, but I don't believe we've hit this for a while. (If Identify why Ampere altras are restarting and not booting properly #2894 (comment) was the last case then almost a year.)
  2. Machine rebooted into grub prompt. This is fixed by applying https://gist.github.com/vielmetti/dafb5128ef7535c218f6d963c5bc624e#prevention-of-boot-failures which I believe has been done to both machines.

I don't think we ever worked out why the machines restarted themselves in the first place.

@github-actions github-actions bot removed the stale label Jan 4, 2024
@sxa
Copy link
Member Author

sxa commented Jan 4, 2024

Hmmm ok if it's been about a year sine we last had an unexplained reboot then I think I'm ok with closing this and we can re-open if required. Hadn't realised it had been so long :-)

@sxa sxa closed this as completed Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants