Skip to content

OCPEDGE-2254: add openshift-install tnf validate-fencing subcommand#10546

Open
Neilhamza wants to merge 6 commits into
openshift:mainfrom
Neilhamza:fencing-validate-openshift-install
Open

OCPEDGE-2254: add openshift-install tnf validate-fencing subcommand#10546
Neilhamza wants to merge 6 commits into
openshift:mainfrom
Neilhamza:fencing-validate-openshift-install

Conversation

@Neilhamza
Copy link
Copy Markdown

@Neilhamza Neilhamza commented May 11, 2026

Summary

Adds openshift-install tnf validate-fencing — a command to validate fencing on Two Node with Fencing (DualReplica) clusters by fencing both nodes sequentially and verifying full recovery.

Usage

# From the install directory
openshift-install tnf validate-fencing

# With explicit directory and SSH key
openshift-install tnf validate-fencing --dir /path/to/install-dir --key ~/.ssh/id_rsa

What it does

Pre-flight checks (via SSH to a control plane node):

  • STONITH devices configured and enabled
  • Both nodes online in Pacemaker
  • corosync, pacemaker, pcsd daemons active
  • etcd has 2 healthy voter members

Disruptive validation (for each node):

  • Fences the node from its peer via pcs stonith fence
  • Waits for the node to go NotReady, then recover to Ready
  • Verifies Pacemaker rejoins and etcd quorum is restored
  • Warns if fencing takes >60s (possible BMC graceful shutdown)

Prerequisites

  • Deployed DualReplica cluster (rejects non-DualReplica with clear error)
  • SSH access to both nodes as core user
  • Cluster-admin kubeconfig at <dir>/auth/kubeconfig

Files

File Change
pkg/fencing/validate.go New — validation logic, SSH checks, parsing, orchestration
pkg/fencing/validate_test.go New — unit tests for all parsing functions
cmd/openshift-install/tnf.go New — cobra command (tnf parent + validate-fencing)
cmd/openshift-install/main.go Register newTNFCmd()
pkg/gather/ssh/ssh.go Add RunOutput() for capturing SSH command output

example usage:
image

OCPEDGE-2254

Add a new subcommand that validates fencing on Two Node with Fencing
(DualReplica) clusters by fencing both nodes sequentially and verifying
recovery. The command runs pre-flight checks (STONITH, Pacemaker, etcd),
then fences each node from its peer and validates full recovery.

This is a disruptive, manual command intended to be run post-install,
post-upgrade, or in CI/CD pipelines.

OCPEDGE-2254

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 11, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented May 11, 2026

@Neilhamza: This pull request references OCPEDGE-2254 which is a valid jira issue.

Details

In response to this:

Add a new subcommand that validates fencing on Two Node with Fencing (DualReplica) clusters by fencing both nodes sequentially and verifying recovery. The command runs pre-flight checks (STONITH, Pacemaker, etcd), then fences each node from its peer and validates full recovery.

This is a disruptive, manual command intended to be run post-install

OCPEDGE-2254

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a top-level tnf CLI command with validate-fencing to run a DualReplica STONITH/fencing validation via SSH against two control-plane nodes, plus an SSH output helper and unit tests. The installer root is wired to register the new tnf command.

Changes

Cluster Fencing Validation

Layer / File(s) Summary
CLI Root & Registration
cmd/openshift-install/main.go
Registers newTNFCmd(ctx) with the installer root command.
CLI Command
cmd/openshift-install/tnf.go
Adds tnf command and validate-fencing subcommand; binds --key flag, starts duration timer, registers Dir cleanup hook, loads kubeconfig from RootOpts.Dir/auth/kubeconfig, creates Kubernetes and OpenShift config clients, and invokes fencing.Run with SSH user core.
Config & Entrypoint
pkg/fencing/validate.go
Introduces Config and NodeInfo types and the Run(ctx, cfg) orchestration entrypoint.
Discovery & Topology
pkg/fencing/validate.go
Discovers control-plane nodes and internal IPs; verifies topology equals exactly two control-plane nodes (DualReplicaTopologyMode).
Pre-flight / SSH Helpers
pkg/fencing/validate.go
Adds SSH connect/run helpers and pre-flight checks: STONITH enabled parsing, Pacemaker online list parsing, daemon status extraction, and etcd endpoint health/membership checks.
Fencing Orchestration
pkg/fencing/validate.go
Implements fenceAndRecover / fenceNode: trigger pcs stonith fence, wait for node NotReady→Ready, poll Pacemaker online list, and poll etcd quorum restoration with timeouts; runs both directions.
Parsing & Utilities
pkg/fencing/validate.go
Parsing helpers and formatters: parseStonithEnabled, parsePacemakerOnline, nodeInOnlineList, parseDaemonStatus, parseEtcdHealth, parseEtcdMembers, resolvePacemakerNames, formatEtcdURL, and polling helpers.
SSH Output Helper
pkg/gather/ssh/ssh.go
Adds RunOutput(client *ssh.Client, command string) (string, string, error) to capture remote stdout/stderr as strings for programmatic checks; requests agent forwarding but ignores forward failures.
Unit Tests
pkg/fencing/validate_test.go
Adds tests for STONITH parsing, Pacemaker online parsing, node matching, daemon status parsing, etcd health/members parsing, and etcd URL formatting.
sequenceDiagram
    actor User
    participant CLI as rgba(70,130,180,0.5) CLI
    participant KubeAPI as rgba(34,139,34,0.5) Kubernetes API
    participant ConfigAPI as rgba(255,165,0,0.5) OpenShift Config API
    participant NodeA as rgba(123,104,238,0.5) Node A
    participant NodeB as rgba(220,20,60,0.5) Node B

    User->>CLI: run tnf validate-fencing
    CLI->>CLI: load kubeconfig, init clients
    CLI->>KubeAPI: list control-plane nodes
    KubeAPI-->>CLI: NodeA, NodeB
    CLI->>NodeA: SSH connect / pre-flight checks
    CLI->>NodeB: SSH connect / pre-flight checks
    CLI->>NodeA: pcs stonith fence NodeB
    CLI->>KubeAPI: wait NodeB NotReady → Ready
    CLI->>NodeA: poll Pacemaker for NodeB online
    CLI->>KubeAPI: poll etcd quorum restoration
    CLI->>NodeB: pcs stonith fence NodeA
    CLI->>KubeAPI: wait NodeA NotReady → Ready
    CLI->>NodeB: poll Pacemaker for NodeA online
    CLI->>KubeAPI: poll etcd quorum restoration
    CLI-->>User: success / failure
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test case names in validate_test.go are stable and deterministic, with no dynamic content like timestamps, UUIDs, generated suffixes, or variables from string formatting.
Test Structure And Quality ✅ Passed The custom check applies to Ginkgo test code, but the PR only includes standard Go unit tests using testing.T with table-driven subtests. Ginkgo is not used in this PR or these utility tests.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. The only test file added (pkg/fencing/validate_test.go) contains standard Go unit tests using testing.T, not Ginkgo framework. The check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR does not add Ginkgo e2e tests. It adds a CLI command, utility libraries, and Go unit tests. The custom check applies only to Ginkgo e2e tests, which are absent.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds CLI tooling and validation utilities. No deployment manifests, operator code, or controllers present. Check not applicable.
Ote Binary Stdout Contract ✅ Passed This PR modifies openshift-install CLI, not an OTE test binary. The custom check applies only to test extension binaries that communicate with openshift-tests via JSON stdout, not installer CLI tools.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR does not add Ginkgo e2e tests. New file pkg/fencing/validate_test.go contains standard Go unit tests only. Check is not applicable.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main change: adding a new tnf validate-fencing subcommand to openshift-install, which is the core purpose of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from barbacbd and bfournie May 11, 2026 08:14
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jcpowermac for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/fencing/validate.go`:
- Around line 40-41: The regexp daemonActiveRe currently uses alternation
incorrectly and matches "enabled" anywhere; update the pattern used in the
regexp.MustCompile call for daemonActiveRe so the alternation applies to the
word after "active" (e.g., group the alternatives so it matches "active"
followed by either "running" or "enabled" in a case-insensitive way), then run
tests/linters to confirm behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3e9f0f54-36b6-4168-b0da-008049d200c4

📥 Commits

Reviewing files that changed from the base of the PR and between 51977d8 and 7852b6c.

📒 Files selected for processing (5)
  • cmd/openshift-install/main.go
  • cmd/openshift-install/tnf.go
  • pkg/fencing/validate.go
  • pkg/fencing/validate_test.go
  • pkg/gather/ssh/ssh.go

Comment thread pkg/fencing/validate.go Outdated
Fix daemonActiveRe regex operator precedence — group alternatives so
it matches "active" followed by either "running" or "enabled".
Fix gofmt import ordering in validate.go and tnf.go.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Neilhamza
Copy link
Copy Markdown
Author

@coderabbitai resolve

@Neilhamza
Copy link
Copy Markdown
Author

@coderabbitai perform full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

@Neilhamza: Sure! I'll kick off a full review of the PR right away.

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

✅ Actions performed

Comments resolved. Approval is disabled; enable reviews.request_changes_workflow to allow explicit top-level @coderabbitai resolve or @coderabbitai approve commands.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/fencing/validate.go`:
- Around line 312-316: The fenceNode function constructs a shell command by
interpolating pcmkName directly into fmt.Sprintf which can lead to command
injection; update fenceNode to escape/quote pcmkName using the existing
shellQuote helper before building the command (i.e., call shellQuote(pcmkName)
and use that result when calling sshRun) so the pcs stonith fence argument is
safely quoted; ensure sshRun is still used and return its error as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6302165f-b110-44a6-b494-a0fedbde6a93

📥 Commits

Reviewing files that changed from the base of the PR and between 51977d8 and afa8f05.

📒 Files selected for processing (5)
  • cmd/openshift-install/main.go
  • cmd/openshift-install/tnf.go
  • pkg/fencing/validate.go
  • pkg/fencing/validate_test.go
  • pkg/gather/ssh/ssh.go

Comment thread pkg/fencing/validate.go
- Quote pcmkName in fenceNode via shellQuote to prevent injection
- Use descriptive test names in TestFormatEtcdURL instead of IP literals
- Fix gofmt: one field per line in test struct literals

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Neilhamza
Copy link
Copy Markdown
Author

@coderabbitai resolve

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

✅ Actions performed

Comments resolved. Approval is disabled; enable reviews.request_changes_workflow to allow explicit top-level @coderabbitai resolve or @coderabbitai approve commands.

Neilhamza and others added 2 commits May 11, 2026 12:46
Group imports per .golangci.yaml gci config:
1. standard (stdlib)
2. default (third-party)
3. prefix(github.com/openshift)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RunOutput now treats agent forwarding failure as non-fatal (debug log
instead of error return). Agent forwarding is not needed for running
commands on the target node and fails on subsequent sessions to the
same SSH client.

Remove 2>/dev/null and 2>&1 redirects from SSH commands — RunOutput
captures stderr separately, so bash-level redirects are unnecessary.

Tested on a real DualReplica cluster: pre-flight checks pass, both
nodes fenced and recovered successfully.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/gather/ssh/ssh.go (1)

86-87: ⚡ Quick win

Wrap sess.Run errors with command context for better diagnostics.

This currently returns a bare SSH error, which makes troubleshooting harder in calling flows. Wrapping it keeps behavior unchanged but improves debuggability (consistent with this file’s existing error style).

Proposed patch
  err = sess.Run(command)
- return stdout.String(), stderr.String(), err
+ if err != nil {
+ 	return stdout.String(), stderr.String(), errors.Wrapf(err, "failed to run remote command %q", command)
+ }
+ return stdout.String(), stderr.String(), nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/gather/ssh/ssh.go` around lines 86 - 87, The call to sess.Run(command)
returns a raw SSH error without context; update the Run flow in
pkg/gather/ssh/ssh.go to wrap that error with the command string (and any other
useful context) before returning so callers get diagnostics (e.g., replace
returning err with a wrapped error via fmt.Errorf or the project's error-wrap
utility, preserving original error as %w, while still returning stdout.String()
and stderr.String()).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/gather/ssh/ssh.go`:
- Around line 86-87: The call to sess.Run(command) returns a raw SSH error
without context; update the Run flow in pkg/gather/ssh/ssh.go to wrap that error
with the command string (and any other useful context) before returning so
callers get diagnostics (e.g., replace returning err with a wrapped error via
fmt.Errorf or the project's error-wrap utility, preserving original error as %w,
while still returning stdout.String() and stderr.String()).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 942b91f6-9cec-43e6-a746-d6416c02f2b2

📥 Commits

Reviewing files that changed from the base of the PR and between f9d8467 and f019692.

📒 Files selected for processing (2)
  • pkg/fencing/validate.go
  • pkg/gather/ssh/ssh.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/fencing/validate.go

Handle the error from sshRun in checkStonith instead of discarding it.
If the command fails AND returns no output, include the error in the
diagnostic message.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Neilhamza Neilhamza changed the title [WIP] OCPEDGE-2254: add openshift-install tnf validate-fencing subcommand OCPEDGE-2254: add openshift-install tnf validate-fencing subcommand May 11, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 11, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2026

@Neilhamza: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants