Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COS-2747: blocked-edges/4.12.*: Declare AMD19hFirmware #5114

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

wking
Copy link
Member

@wking wking commented Apr 16, 2024

The h suffix in 19h stands for "hexadecimal", so 19h is decimal 25. The first group by matches CPUs exposed to the firmware bug with a value of one. The second group by shows that the node_cpu_info metric is working on at least some nodes.

I'm preserving vendor and family in the groupings, because it's nice to be able to say "we aren't matching this risk for your cluster, because we see..." and then point out an Intel CPU, or an AMD CPU with a different family, or some other example of why we don't think the risk applies to that cluster. The topk ensures that if we have both risk-matching CPUs and non-matching CPUs, we prefer the matching result (value 1) over the non-matching result(s) (value 0).

The _id="" portion is a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (#3591).

Generated by writing the 4.12.45 risk by hand and copying it out to other impacted patch versions with:

$ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done

And then I set fixedIn: 4.12.53 in the 4.12.51 risk (we never had a 4.12.52 release, so 4.12.53 is the next one after 4.12.51).

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 16, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 16, 2024

@wking: This pull request references COS-2747 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

The h suffix in 19h stands for "hexadecimal", so 19h is decimal 25. The first group by matches CPUs exposed to the firmware bug with a value of one. The second group by shows that the node_cpu_info metric is working on at least some nodes.

I'm preserving vendor and family in the groupings, because it's nice to be able to say "we aren't matching this risk for your cluster, because we see..." and then point out an Intel CPU, or an AMD CPU with a different family, or some other example of why we don't think the risk applies to that cluster. The topk ensures that if we have both risk-matching CPUs and non-matching CPUs, we prefer the matching result (value 1) over the non-matching result(s) (value 0).

The _id="" portion is a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (#3591).

Generated by writing the 4.12.45 risk by hand and copying it out to other impacted patch versions with:

$ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done

And then I set fixedIn: 4.12.53 in the 4.12.51 risk (we never had a 4.12.52 release, so 4.12.53 is the next one after 4.12.51).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 16, 2024
The h suffix in "19h" stands for "hexadecimal", so 19h is decimal 25.
The first group-by matches CPUs exposed to the firmware bug with a
value of one.  The second group-by shows that the node_cpu_info metric
is working on at least some nodes.

I'm preserving vendor and family in the groupings, because it's nice
to be able to say "we aren't matching this risk for your cluster,
because we see..." and then point out an Intel CPU, or an AMD CPU with
a different family, or some other example of why we don't think the
risk applies to that cluster.  The topk ensures that if we have both
risk-matching CPUs and non-matching CPUs, we prefer the matching
result (value 1) over the non-matching result(s) (value 0).

The _id="" portion is a pattern to support HyperShift and other
systems that could query the cluster's data out of a PromQL engine
that stored data for multiple clusters.  More context in 5cb2e93
(blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="",
2023-05-09, openshift#3591)

Generated by writing the 4.12.45 risk by hand and copying it out to
other impacted patch versions with:

  $ for Z in $(seq 46 51); do sed "s/4.12.45/4.12.${Z}/" blocked-edges/4.12.45-AMD19hFirmware.yaml > "blocked-edges/4.12.${Z}-AMD19hFirmware.yaml"; done

And then I set 'fixedIn: 4.12.53' in the 4.12.51 risk (we never had a
4.12.52 release, so 4.12.53 is the next one after 4.12.51).
@wking
Copy link
Member Author

wking commented Apr 16, 2024

Demos on a 4.12.55 -aws-sdn-serial CI run in PromeCIeus:

image

so that cluster might have been exposed?

count by (vendor, family) (node_cpu_info)

gives 24, with no other vendor/family options. From gathered artifacts for one node:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.12-e2e-aws-sdn-serial/1775993088406196224/artifacts/e2e-aws-sdn-serial/gather-extra/artifacts/nodes/ip-10-0-137-7.ec2.internal/journal | zgrep 'CPU[0-9]'
Apr 04 21:11:26.858724 localhost kernel: smpboot: CPU0: AMD EPYC 7R13 Processor (family: 0x19, model: 0x1, stepping: 0x1)
Apr 04 21:18:34.868558 localhost kernel: smpboot: CPU0: AMD EPYC 7R13 Processor (family: 0x19, model: 0x1, stepping: 0x1)

so, yes, looks like we expect that CPU to be exposed.

@sdodson
Copy link
Member

sdodson commented Apr 16, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 16, 2024
Copy link
Contributor

openshift-ci bot commented Apr 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member Author

wking commented Apr 16, 2024

And here's the gcp-ovn-upgrade CI run for that 4.12 release, showing it running at least two CPUs that aren't exposed:

image

@openshift-merge-bot openshift-merge-bot bot merged commit e9d2bc1 into openshift:master Apr 16, 2024
4 of 5 checks passed
@wking wking deleted the AMD19hFirmware branch April 16, 2024 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
3 participants