Skip to content

Conversation

@jaypoulz
Copy link

@jaypoulz jaypoulz commented Oct 21, 2025

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Not gated because it's only used by CEO when two-node has transitioned.

Works in conjunction with openshift/cluster-etcd-operator#1487

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 21, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 21, 2025

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Gated by DualReplica feature and managed by two-node-fencing component.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 21, 2025

Hello @jaypoulz! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 21, 2025
@openshift-ci openshift-ci bot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Oct 21, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 21, 2025

@jaypoulz: This pull request references OCPEDGE-2084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Introduces tnf.etcd.openshift.io/v1alpha1 API group with PacemakerStatus custom resource. This provides visibility into Pacemaker cluster health for dual-replica etcd deployments. The status-only resource is populated by a privileged controller and consumed by the cluster-etcd-operator healthcheck controller. Gated by DualReplica feature and managed by two-node-fencing component.

Works in conjunction with openshift/cluster-etcd-operator#1487

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Oct 21, 2025
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 4 times, most recently from 2ba442d to 29b9fec Compare October 21, 2025 23:56
@saschagrunert
Copy link
Member

@jaypoulz thank you for the PR, do you mind making the CI happy?

@jaypoulz
Copy link
Author

Hi @saschagrunert :) Working on it! :D
New to this repo so working through beginner challenges 😸

@jaypoulz
Copy link
Author

A few open questions I have:

  1. This is a config object of a sort. It's created by cluster-etcd-operator only when you have a two-node cluster and only for the purposes of gathering information about the health of pacemaker (our ha tool) from the nodes. I put it in etcd/tnf (two node fencing) because it seemed sensible. But I'm not sure if it needs to be in config.

That said, it doesn't work like a normal config - there's no spec and it shouldn't be created during bootstrap. The CRD just needs to be present when the CEO runs an cronjob to post an update to it.

  1. bash hack/update-protobuf.sh failed for me because it wanted the path to be installed in my go path. That said, cursor happily runs it and copies over the files without issue. I'm just skeptical of the zz_generated files, but I assume those are verified by CI?

  2. For the non-boolean enum fields. Should I be creating static string definitions that can be exported to CEO? How do I generate those?

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 2 times, most recently from b0ff230 to 1b57b09 Compare October 22, 2025 16:59
@openshift-ci openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 22, 2025
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 4 times, most recently from b9b727f to fdd53e9 Compare October 22, 2025 20:37
@saschagrunert
Copy link
Member

saschagrunert commented Oct 23, 2025

Yeah, I'll ignore the CI failures for now, running ./hack/update-codegen.sh locally also gives me a diff in openapi/generated_openapi/zz_generated.openapi.go. 🙃

A few open questions I have:

  1. This is a config object of a sort. It's created by cluster-etcd-operator only when you have a two-node cluster and only for the purposes of gathering information about the health of pacemaker (our ha tool) from the nodes. I put it in etcd/tnf (two node fencing) because it seemed sensible. But I'm not sure if it needs to be in config.

I'm new to API review, but my gut feeling tells me that a dedicated etcd API group sounds fine for that purpose.

That said, it doesn't work like a normal config - there's no spec and it shouldn't be created during bootstrap. The CRD just needs to be present when the CEO runs an cronjob to post an update to it.

  1. bash hack/update-protobuf.sh failed for me because it wanted the path to be installed in my go path. That said, cursor happily runs it and copies over the files without issue. I'm just skeptical of the zz_generated files, but I assume those are verified by CI?

You can also try to run it in a container by make verify-with-container.

  1. For the non-boolean enum fields. Should I be creating static string definitions that can be exported to CEO? How do I generate those?

Do you mind elaborating on that? Do you mean generating the code for the unions?

API docs ref: https://github.com/openshift/enhancements/blob/master/dev-guide/api-conventions.md#writing-a-union-in-go


@jaypoulz is there an OpenShift enhancement available for this change?

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 3 times, most recently from 3f45017 to 2fb0282 Compare October 24, 2025 21:15
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2025
@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 5 times, most recently from 8979f47 to 6ca958d Compare October 28, 2025 00:42
@jaypoulz
Copy link
Author

@saschagrunert I think I hit all of your comments. I've also asked pacemaker expert CLumens from the RHEL team to make sure I wasn't misrepresenting anything in the new spec.

@saschagrunert
Copy link
Member

/retest

Comment on lines 289 to 302
// ipv4Address is the IPv4 address of the node, if registered via IPv4
// +kubebuilder:validation:MinLength=7
// +kubebuilder:validation:MaxLength=15
// +kubebuilder:validation:Pattern="^((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$"
// +optional
IPv4Address string `json:"ipv4Address,omitempty"`

// ipv6Address is the IPv6 address of the node, if registered via IPv6
// +kubebuilder:validation:MinLength=2
// +kubebuilder:validation:MaxLength=39
// +kubebuilder:validation:Format=ipv6
// +kubebuilder:validation:Pattern=`^(([0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))$`
// +optional
IPv6Address string `json:"ipv6Address,omitempty"`
Copy link
Member

@saschagrunert saschagrunert Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CEL has IP validations that support both IPv4 and IPv6. It would be better to combine them and use the CEL validations instead, sorry for the back and forth here:

https://github.com/kubernetes/kubernetes/blob/f0ed028e753f97f8b74044c75b8d746e1dce00c6/staging/src/k8s.io/apiserver/pkg/cel/library/ip.go#L30-L125

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no no I appreciate this :D
I can see the API getting better with each revision 🥂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what I saw "canonical" seems to be the way to test for a valid IPv4 or IPv6 address.
Added validation based on what I saw elsewhere in API

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no, this needs further work. Canonical is a useful check but it doesn't guarantee that you have a usable individual IP. Adding more checks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saschagrunert so it turns out the version of schema checker is too old to support ip(self).isCanonical()
I've added parsing for the IP in the code that invokes the API, so I think it's overkill to update schema checker just for this, but I wanted to explain why it's no longer in the diff.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of why isIP() is not sufficient:
It depends how strict we want to be in this API. The IPs we use are expected to identify the nodes as their endpoint-identities for etcd. So they should be unique, they should (ideally) be in their canonical form, and they should be the kinds of IPs that are not reserved for special cases.

I defaulted to more strict because I've never written one of these so I decided to err on the side of adding the restrictions that made sense to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it back and we I'll defer to your guidance on how to proceed :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking the docs https://github.com/kubernetes/kubernetes/blob/3daf280c464c712f38fe2a24d9434fcf2670c251/staging/src/k8s.io/apiserver/pkg/cel/library/ip.go#L76

Looks like ip.isCanonical(self) might be the right incantation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange o.O
I'll try it 😺

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 3 times, most recently from 3e02535 to e6b5c99 Compare October 28, 2025 17:20

// nodeHistory provides recent operation history for troubleshooting
// When present, it must be a list of 1 or more PacemakerNodeHistoryEntry objects.
// When not present, the node history is not available. This is the expected status for a healthy cluster.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't think that an empty node history is expected status. Assuming that this is basically <node_history/> from crm_mon, you should have a tree with at least start and monitor operations for every resource on each node. If there's no history, I would assume there's no running resources.

Copy link
Author

@jaypoulz jaypoulz Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we allow this to be empty is that we only push up "recent" information. Basically, we are trying to collect the information from pacemaker that would indicate that we've gone off the rails. So before this API is invoked, we gather all of the information, then filter out any node history event that isn't within the last 5 minutes.

For fencing history, we carry failures longer - a 24 hour long context window.

So it's not a full 1-1 mapping. :) I'll make a note of these in the API.
This information is used for event records only. I don't think we need to be exhaustive about all events, just warning that something happened within the last n minutes or hours is needed for the event record.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this check runs every 30 seconds, and events get reported exactly once (deduplication is done on the client side).

// nodeHistory provides recent operation history for troubleshooting
// When present, it must be a list of 1 or more PacemakerNodeHistoryEntry objects.
// When not present, the node history is not available. This is the expected status for a healthy cluster.
// Node history being capped at 16 is a reasonable limit to prevent abuse of the API, since the action history reported by the cluster
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the number of resources you've got running, 16 may be too low. On my test cluster, each resource has two history entries just from starting up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 6 (2 kubelet, 2 etcd, 2 fencing-agents)
I can bump it to 32, but would have the same concern given the context that we only show node history for the last 5 minutes of history?

Copy link
Author

@jaypoulz jaypoulz Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specfically:
(pre-API) Events reported = events that occured in the last 5 minutes running every 30s
(post-API) Events presented to user = events that occured in the last 5 minutes - events already reported running every 30s

// +kubebuilder:validation:Minimum=0
// +kubebuilder:validation:Maximum=16
// +optional
ResourcesTotal *int32 `json:"resourcesTotal,omitempty"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you care about maintenance mode or Pacemaker Remote nodes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TNF doesn't use either of these. If we end up needing to introduce maintenance mode for whatever reason, some extensions to the API would be needed. Likewise, I don't see us ever supporting remote nodes.

That said, is there a specific reason you highlighted this concern for resourcesTotal? Or was this a general question for why we don't check for this when we gather node info?

// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=256
// +optional
Node string `json:"node,omitempty"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you support clone resources? If so, those can run on multiple nodes at the same time in which case making this some sort of list type would make more sense to me. Also if you care about clones, keep in mind that the name of the primitive resource being cloned is not unique.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do support clone resources. Both etcd and kublet rune as clone resources. This is why the expected number of resources is 6 (clones for etcd and kubelet), and unique fencing agents for both nodes.

Currently when we build out the error message, we go through them all individually. Grouping them is an interesting idea. It could improve visual clarity, But it's seems like something we can do during rendering. Treating each resource as unique feels simpler.

@jaypoulz jaypoulz force-pushed the OCPEDGE-2084 branch 3 times, most recently from d29f516 to cf53006 Compare October 28, 2025 23:11
Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from an API Shadow review perspective.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 29, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: saschagrunert
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@saschagrunert
Copy link
Member

/retest


// PacemakerDaemonStateType represents the state of the pacemaker daemon
// +kubebuilder:validation:Enum=Running;KnownNotRunning
type PacemakerDaemonStateType string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to add docs about the possible values here as well. If so, then the same would apply to QuorumStatusType, NodeOnlineStatusType, NodeModeType, ResourceRoleType, ResourceActiveStatusType, FencingActionType and FencingStatusType

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add them just for completeness :)

@JoelSpeed
Copy link
Contributor

Since @saschagrunert has said this is good from his side, I'll now take over the API review. Since it's shift week, I'm not expecting to pick this up until Monday

@jaypoulz
Copy link
Author

Sounds good to me! :)

Introduces etcd.openshift.io/v1alpha1 API group with a PacemakerCluster
custom resource. This provides visibility into Pacemaker cluster health for
Two Node Fencing (TNF) etcd deployments. The status-only resource is populated by a
privileged controller and consumed by the cluster-etcd-operator healthcheck
controller. This API is not explicitly gated because it's only created by CEO
once the transition to an ExternalEtcd has occured. This means that it is
naturally gated by the TNF topology.
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 29, 2025

@jaypoulz: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn df97bb6 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants