-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COS-2747: blocked-edges/4.12.*: Declare AMD19hFirmware #5114
Conversation
@wking: This pull request references COS-2747 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
The h suffix in "19h" stands for "hexadecimal", so 19h is decimal 25. The first group-by matches CPUs exposed to the firmware bug with a value of one. The second group-by shows that the node_cpu_info metric is working on at least some nodes. I'm preserving vendor and family in the groupings, because it's nice to be able to say "we aren't matching this risk for your cluster, because we see..." and then point out an Intel CPU, or an AMD CPU with a different family, or some other example of why we don't think the risk applies to that cluster. The topk ensures that if we have both risk-matching CPUs and non-matching CPUs, we prefer the matching result (value 1) over the non-matching result(s) (value 0). The _id="" portion is a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) Generated by writing the 4.12.45 risk by hand and copying it out to other impacted patch versions with: $ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done And then I set 'fixedIn: 4.12.53' in the 4.12.51 risk (we never had a 4.12.52 release, so 4.12.53 is the next one after 4.12.51).
Demos on a 4.12.55 -aws-sdn-serial CI run in PromeCIeus: so that cluster might have been exposed?
gives 24, with no other vendor/family options. From gathered artifacts for one node: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1775993088406196224/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes/ip-10-0-137-7.ec2.internal/journal | zgrep 'CPU[0-9]'
Apr 04 21:11:26.858724 localhost kernel: smpboot: CPU0: AMD EPYC 7R13 Processor (family: 0x19, model: 0x1, stepping: 0x1)
Apr 04 21:18:34.868558 localhost kernel: smpboot: CPU0: AMD EPYC 7R13 Processor (family: 0x19, model: 0x1, stepping: 0x1) so, yes, looks like we expect that CPU to be exposed. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sdodson, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
And here's the gcp-ovn-upgrade CI run for that 4.12 release, showing it running at least two CPUs that aren't exposed: |
e9d2bc1
into
openshift:master
The
h
suffix in19h
stands for "hexadecimal", so19h
is decimal 25. The firstgroup by
matches CPUs exposed to the firmware bug with a value of one. The secondgroup by
shows that thenode_cpu_info
metric is working on at least some nodes.I'm preserving
vendor
andfamily
in the groupings, because it's nice to be able to say "we aren't matching this risk for your cluster, because we see..." and then point out an Intel CPU, or an AMD CPU with a different family, or some other example of why we don't think the risk applies to that cluster. Thetopk
ensures that if we have both risk-matching CPUs and non-matching CPUs, we prefer the matching result (value 1) over the non-matching result(s) (value 0).The
_id=""
portion is a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (#3591).Generated by writing the 4.12.45 risk by hand and copying it out to other impacted patch versions with:
$ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done
And then I set
fixedIn: 4.12.53
in the 4.12.51 risk (we never had a 4.12.52 release, so 4.12.53 is the next one after 4.12.51).