COS-2747: blocked-edges/4.12.*: Declare AMD19hFirmware #5114

wking · 2024-04-16T15:58:40Z

The h suffix in 19h stands for "hexadecimal", so 19h is decimal 25. The first group by matches CPUs exposed to the firmware bug with a value of one. The second group by shows that the node_cpu_info metric is working on at least some nodes.

I'm preserving vendor and family in the groupings, because it's nice to be able to say "we aren't matching this risk for your cluster, because we see..." and then point out an Intel CPU, or an AMD CPU with a different family, or some other example of why we don't think the risk applies to that cluster. The topk ensures that if we have both risk-matching CPUs and non-matching CPUs, we prefer the matching result (value 1) over the non-matching result(s) (value 0).

The _id="" portion is a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (#3591).

Generated by writing the 4.12.45 risk by hand and copying it out to other impacted patch versions with:

$ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done

And then I set fixedIn: 4.12.53 in the 4.12.51 risk (we never had a 4.12.52 release, so 4.12.53 is the next one after 4.12.51).

openshift-ci-robot · 2024-04-16T15:58:44Z

@wking: This pull request references COS-2747 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

The h suffix in 19h stands for "hexadecimal", so 19h is decimal 25. The first group by matches CPUs exposed to the firmware bug with a value of one. The second group by shows that the node_cpu_info metric is working on at least some nodes.

I'm preserving vendor and family in the groupings, because it's nice to be able to say "we aren't matching this risk for your cluster, because we see..." and then point out an Intel CPU, or an AMD CPU with a different family, or some other example of why we don't think the risk applies to that cluster. The topk ensures that if we have both risk-matching CPUs and non-matching CPUs, we prefer the matching result (value 1) over the non-matching result(s) (value 0).

The _id="" portion is a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (#3591).

Generated by writing the 4.12.45 risk by hand and copying it out to other impacted patch versions with:
$ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done
And then I set fixedIn: 4.12.53 in the 4.12.51 risk (we never had a 4.12.52 release, so 4.12.53 is the next one after 4.12.51).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

The h suffix in "19h" stands for "hexadecimal", so 19h is decimal 25. The first group-by matches CPUs exposed to the firmware bug with a value of one. The second group-by shows that the node_cpu_info metric is working on at least some nodes. I'm preserving vendor and family in the groupings, because it's nice to be able to say "we aren't matching this risk for your cluster, because we see..." and then point out an Intel CPU, or an AMD CPU with a different family, or some other example of why we don't think the risk applies to that cluster. The topk ensures that if we have both risk-matching CPUs and non-matching CPUs, we prefer the matching result (value 1) over the non-matching result(s) (value 0). The _id="" portion is a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) Generated by writing the 4.12.45 risk by hand and copying it out to other impacted patch versions with: $ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done And then I set 'fixedIn: 4.12.53' in the 4.12.51 risk (we never had a 4.12.52 release, so 4.12.53 is the next one after 4.12.51).

wking · 2024-04-16T17:47:39Z

Demos on a 4.12.55 -aws-sdn-serial CI run in PromeCIeus:

so that cluster might have been exposed?

count by (vendor, family) (node_cpu_info)

gives 24, with no other vendor/family options. From gathered artifacts for one node:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1775993088406196224/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes/ip-10-0-137-7.ec2.internal/journal | zgrep 'CPU[0-9]'
Apr 04 21:11:26.858724 localhost kernel: smpboot: CPU0: AMD EPYC 7R13 Processor (family: 0x19, model: 0x1, stepping: 0x1)
Apr 04 21:18:34.868558 localhost kernel: smpboot: CPU0: AMD EPYC 7R13 Processor (family: 0x19, model: 0x1, stepping: 0x1)

so, yes, looks like we expect that CPU to be exposed.

sdodson · 2024-04-16T17:53:32Z

/lgtm

openshift-ci · 2024-04-16T17:55:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2024-04-16T17:57:28Z

And here's the gcp-ovn-upgrade CI run for that 4.12 release, showing it running at least two CPUs that aren't exposed:

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 16, 2024

openshift-ci bot requested review from LalatenduMohanty and PratikMahajan April 16, 2024 16:06

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 16, 2024

wking force-pushed the AMD19hFirmware branch from 43a9da8 to 13f32fd Compare April 16, 2024 17:42

openshift-ci bot assigned sdodson Apr 16, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 16, 2024

openshift-merge-bot bot merged commit e9d2bc1 into openshift:master Apr 16, 2024
4 of 5 checks passed

wking deleted the AMD19hFirmware branch April 16, 2024 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COS-2747: blocked-edges/4.12.*: Declare AMD19hFirmware #5114

COS-2747: blocked-edges/4.12.*: Declare AMD19hFirmware #5114

wking commented Apr 16, 2024

openshift-ci-robot commented Apr 16, 2024 •

edited by openshift-ci bot

wking commented Apr 16, 2024

sdodson commented Apr 16, 2024

openshift-ci bot commented Apr 16, 2024

wking commented Apr 16, 2024

COS-2747: blocked-edges/4.12.*: Declare AMD19hFirmware #5114

COS-2747: blocked-edges/4.12.*: Declare AMD19hFirmware #5114

Conversation

wking commented Apr 16, 2024

openshift-ci-robot commented Apr 16, 2024 • edited by openshift-ci bot

wking commented Apr 16, 2024

sdodson commented Apr 16, 2024

openshift-ci bot commented Apr 16, 2024

wking commented Apr 16, 2024

openshift-ci-robot commented Apr 16, 2024 •

edited by openshift-ci bot