OCPEDGE-2254: add openshift-install tnf validate-fencing subcommand#10546
OCPEDGE-2254: add openshift-install tnf validate-fencing subcommand#10546Neilhamza wants to merge 6 commits into
Conversation
Add a new subcommand that validates fencing on Two Node with Fencing (DualReplica) clusters by fencing both nodes sequentially and verifying recovery. The command runs pre-flight checks (STONITH, Pacemaker, etcd), then fences each node from its peer and validates full recovery. This is a disruptive, manual command intended to be run post-install, post-upgrade, or in CI/CD pipelines. OCPEDGE-2254 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@Neilhamza: This pull request references OCPEDGE-2254 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a top-level ChangesCluster Fencing Validation
sequenceDiagram
actor User
participant CLI as rgba(70,130,180,0.5) CLI
participant KubeAPI as rgba(34,139,34,0.5) Kubernetes API
participant ConfigAPI as rgba(255,165,0,0.5) OpenShift Config API
participant NodeA as rgba(123,104,238,0.5) Node A
participant NodeB as rgba(220,20,60,0.5) Node B
User->>CLI: run tnf validate-fencing
CLI->>CLI: load kubeconfig, init clients
CLI->>KubeAPI: list control-plane nodes
KubeAPI-->>CLI: NodeA, NodeB
CLI->>NodeA: SSH connect / pre-flight checks
CLI->>NodeB: SSH connect / pre-flight checks
CLI->>NodeA: pcs stonith fence NodeB
CLI->>KubeAPI: wait NodeB NotReady → Ready
CLI->>NodeA: poll Pacemaker for NodeB online
CLI->>KubeAPI: poll etcd quorum restoration
CLI->>NodeB: pcs stonith fence NodeA
CLI->>KubeAPI: wait NodeA NotReady → Ready
CLI->>NodeB: poll Pacemaker for NodeA online
CLI->>KubeAPI: poll etcd quorum restoration
CLI-->>User: success / failure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/fencing/validate.go`:
- Around line 40-41: The regexp daemonActiveRe currently uses alternation
incorrectly and matches "enabled" anywhere; update the pattern used in the
regexp.MustCompile call for daemonActiveRe so the alternation applies to the
word after "active" (e.g., group the alternatives so it matches "active"
followed by either "running" or "enabled" in a case-insensitive way), then run
tests/linters to confirm behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3e9f0f54-36b6-4168-b0da-008049d200c4
📒 Files selected for processing (5)
cmd/openshift-install/main.gocmd/openshift-install/tnf.gopkg/fencing/validate.gopkg/fencing/validate_test.gopkg/gather/ssh/ssh.go
Fix daemonActiveRe regex operator precedence — group alternatives so it matches "active" followed by either "running" or "enabled". Fix gofmt import ordering in validate.go and tnf.go. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@coderabbitai resolve |
|
@coderabbitai perform full review |
|
✅ Actions performedFull review triggered. |
✅ Actions performedComments resolved. Approval is disabled; enable |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/fencing/validate.go`:
- Around line 312-316: The fenceNode function constructs a shell command by
interpolating pcmkName directly into fmt.Sprintf which can lead to command
injection; update fenceNode to escape/quote pcmkName using the existing
shellQuote helper before building the command (i.e., call shellQuote(pcmkName)
and use that result when calling sshRun) so the pcs stonith fence argument is
safely quoted; ensure sshRun is still used and return its error as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6302165f-b110-44a6-b494-a0fedbde6a93
📒 Files selected for processing (5)
cmd/openshift-install/main.gocmd/openshift-install/tnf.gopkg/fencing/validate.gopkg/fencing/validate_test.gopkg/gather/ssh/ssh.go
- Quote pcmkName in fenceNode via shellQuote to prevent injection - Use descriptive test names in TestFormatEtcdURL instead of IP literals - Fix gofmt: one field per line in test struct literals Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@coderabbitai resolve |
✅ Actions performedComments resolved. Approval is disabled; enable |
Group imports per .golangci.yaml gci config: 1. standard (stdlib) 2. default (third-party) 3. prefix(github.com/openshift) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RunOutput now treats agent forwarding failure as non-fatal (debug log instead of error return). Agent forwarding is not needed for running commands on the target node and fails on subsequent sessions to the same SSH client. Remove 2>/dev/null and 2>&1 redirects from SSH commands — RunOutput captures stderr separately, so bash-level redirects are unnecessary. Tested on a real DualReplica cluster: pre-flight checks pass, both nodes fenced and recovered successfully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/gather/ssh/ssh.go (1)
86-87: ⚡ Quick winWrap
sess.Runerrors with command context for better diagnostics.This currently returns a bare SSH error, which makes troubleshooting harder in calling flows. Wrapping it keeps behavior unchanged but improves debuggability (consistent with this file’s existing error style).
Proposed patch
err = sess.Run(command) - return stdout.String(), stderr.String(), err + if err != nil { + return stdout.String(), stderr.String(), errors.Wrapf(err, "failed to run remote command %q", command) + } + return stdout.String(), stderr.String(), nil🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/gather/ssh/ssh.go` around lines 86 - 87, The call to sess.Run(command) returns a raw SSH error without context; update the Run flow in pkg/gather/ssh/ssh.go to wrap that error with the command string (and any other useful context) before returning so callers get diagnostics (e.g., replace returning err with a wrapped error via fmt.Errorf or the project's error-wrap utility, preserving original error as %w, while still returning stdout.String() and stderr.String()).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/gather/ssh/ssh.go`:
- Around line 86-87: The call to sess.Run(command) returns a raw SSH error
without context; update the Run flow in pkg/gather/ssh/ssh.go to wrap that error
with the command string (and any other useful context) before returning so
callers get diagnostics (e.g., replace returning err with a wrapped error via
fmt.Errorf or the project's error-wrap utility, preserving original error as %w,
while still returning stdout.String() and stderr.String()).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 942b91f6-9cec-43e6-a746-d6416c02f2b2
📒 Files selected for processing (2)
pkg/fencing/validate.gopkg/gather/ssh/ssh.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/fencing/validate.go
Handle the error from sshRun in checkStonith instead of discarding it. If the command fails AND returns no output, include the error in the diagnostic message. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@Neilhamza: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Adds
openshift-install tnf validate-fencing— a command to validate fencing on Two Node with Fencing (DualReplica) clusters by fencing both nodes sequentially and verifying full recovery.Usage
What it does
Pre-flight checks (via SSH to a control plane node):
Disruptive validation (for each node):
pcs stonith fencePrerequisites
coreuser<dir>/auth/kubeconfigFiles
pkg/fencing/validate.gopkg/fencing/validate_test.gocmd/openshift-install/tnf.gotnfparent +validate-fencing)cmd/openshift-install/main.gonewTNFCmd()pkg/gather/ssh/ssh.goRunOutput()for capturing SSH command outputexample usage:

OCPEDGE-2254