Skip to content

Bug 86133: validate duplicate Nutanix failureDomain names and topology#10564

Open
chdeshpa-hue wants to merge 2 commits into
openshift:mainfrom
chdeshpa-hue:fix/OCPBUGS-86073-nutanix-duplicate-failuredomain-validation
Open

Bug 86133: validate duplicate Nutanix failureDomain names and topology#10564
chdeshpa-hue wants to merge 2 commits into
openshift:mainfrom
chdeshpa-hue:fix/OCPBUGS-86073-nutanix-duplicate-failuredomain-validation

Conversation

@chdeshpa-hue
Copy link
Copy Markdown

@chdeshpa-hue chdeshpa-hue commented May 19, 2026

Summary

Adds two missing validation checks for Nutanix failure domains:

  • Duplicate name detection: Rejects configurations where two failure domains share the same name. This was previously unvalidated for Nutanix (vSphere already had this check).
  • Duplicate topology detection: Rejects failure domains that have identical Prism Element UUID and subnet UUIDs. Copy-pasted failure domains that differ only in name provide no additional fault tolerance and can cause subtle scheduling issues.

Split from #10561 per reviewer feedback — this PR contains the Nutanix portion only.

Bug: https://redhat.atlassian.net/browse/OCPBUGS-86133

Manual Test Results

Tested with a custom-built openshift-install binary. When two Nutanix failure domains share the same name or identical topology, the installer now correctly rejects:

Duplicate name:

ERROR failed to load asset "Install Config": invalid "install-config.yaml" file:
  platform.nutanix.failureDomains[1].name: Duplicate value:
    "failure domain name \"fd-pe1\" is already used by failureDomains[0]"

Duplicate topology:

ERROR failed to load asset "Install Config": invalid "install-config.yaml" file:
  platform.nutanix.failureDomains[1]: Invalid value: "fd-pe1-copy":
    failure domain "fd-pe1-copy" has identical topology (same prismElement and subnets)
    as "fd-pe1"; this provides no additional fault tolerance

Test Plan

  • Unit tests pass: go test ./pkg/types/nutanix/validation/
  • New test cases: failureDomain with duplicate name, failureDomain with duplicate topology same prismElement and subnet, valid failureDomain with different prismElements
  • Manual testing with custom binary

Made with Cursor

Summary by CodeRabbit

  • Bug Fixes

    • Improved Nutanix platform failure-domain validation to catch duplicate failure-domain names and detect identical topology configurations (Prism Element plus subnet set), including order-independent subnet comparisons.
  • Tests

    • Added table-driven tests covering duplicate names, identical topologies, and order-independence of subnet lists.

Adds two checks to Nutanix failure domain validation:

1. Duplicate name detection: rejects configurations where two failure
   domains share the same name, which previously went unvalidated.

2. Duplicate topology detection: rejects failure domains that have
   identical Prism Element UUID and subnet UUIDs. Copy-pasted failure
   domains that differ only in name provide no additional fault
   tolerance and can cause subtle scheduling issues.

Bug: https://redhat.atlassian.net/browse/OCPBUGS-86073
Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Walkthrough

The PR enhances ValidatePlatform to detect duplicate Nutanix failure domains by name and by topology (Prism Element UUID plus the set of subnet UUIDs). It adds deterministic topology-key construction using sorted subnet UUIDs and updates tests to cover duplicate-name and duplicate-topology cases.

Changes

Failure-domain duplication validation

Layer / File(s) Summary
Duplicate failure-domain validation logic
pkg/types/nutanix/validation/platform.go
Import sort and strings. Extend failure-domain validation to iterate by index, track and report duplicate failure-domain names via field.Duplicate, and detect duplicate topologies using a new nutanixFailureDomainTopologyKey helper that builds a deterministic key from Prism Element UUID and sorted subnet UUIDs.
Failure-domain duplication test cases
pkg/types/nutanix/validation/platform_test.go
Add table-driven cases to TestValidatePlatform: duplicate failure-domain names produce an error; identical topology (same prismElement and same SubnetUUIDs, order-independent) across different entries produces an error; differing prismElement with shared SubnetUUIDs passes validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding validation checks for duplicate Nutanix failure domain names and topology.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the PR are static string literals with no dynamic content (UUIDs, timestamps, IPs, generated suffixes). Test names are descriptive and stable across runs.
Test Structure And Quality ✅ Passed The custom check specifies Ginkgo test code review, but the PR contains standard Go unit tests using table-driven patterns with testify assertions, not Ginkgo tests.
Microshift Test Compatibility ✅ Passed PR adds standard Go unit tests, not Ginkgo e2e tests. The custom check only applies to Ginkgo e2e tests, so it is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR contains only unit tests using standard Go testing package, not Ginkgo e2e tests. Check applies only to e2e tests; SNO compatibility verification is not applicable here.
Topology-Aware Scheduling Compatibility ✅ Passed Validates Nutanix installer configuration only; does not introduce pod scheduling constraints, affinity rules, or topology-aware scheduling logic.
Ote Binary Stdout Contract ✅ Passed Both changed files are validation library functions and unit tests with no process-level code, no stdout writes, and no logging configured to emit to stdout at the process level.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added. The PR contains only standard Go unit tests (testing.T with table-driven cases) for Nutanix validation logic. No IPv4 assumptions or external connectivity detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 19, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

Hi @chdeshpa-hue. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot requested review from jcpowermac and rvanderp3 May 19, 2026 12:02
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jcpowermac for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/types/nutanix/validation/platform_test.go (1)

433-451: ⚡ Quick win

Add a regression case for subnet-order invariance.

Since topology comparison depends on sorted subnet UUIDs, add a case where two failure domains have identical subnet sets in different orders and still fail as duplicate topology. This protects the core normalization behavior from regressions.

Example test case to add
+       {
+           name: "failureDomain duplicate topology with same subnets in different order",
+           platform: func() *nutanix.Platform {
+               p := validPlatform()
+               p.FailureDomains = []nutanix.FailureDomain{
+                   {
+                       Name:         "fd-1",
+                       PrismElement: nutanix.PrismElement{UUID: "fd-pe-uuid", Endpoint: nutanix.PrismEndpoint{Address: "fd-pe", Port: 9440}},
+                       SubnetUUIDs:  []string{"subnet-a", "subnet-b"},
+                   },
+                   {
+                       Name:         "fd-2",
+                       PrismElement: nutanix.PrismElement{UUID: "fd-pe-uuid", Endpoint: nutanix.PrismEndpoint{Address: "fd-pe", Port: 9440}},
+                       SubnetUUIDs:  []string{"subnet-b", "subnet-a"},
+                   },
+               }
+               return p
+           }(),
+           expectedError: `test-path\.failureDomains\[1\]: Invalid value: "fd-2": failure domain "fd-2" has identical topology`,
+       },
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/types/nutanix/validation/platform_test.go` around lines 433 - 451, Add a
regression test variant to the existing failure-domain duplicate-topology case
that verifies subnet-order invariance: modify the test case that uses
nutanix.Platform / FailureDomain (the case with Name "fd-1" and "fd-2") so one
FailureDomain has SubnetUUIDs in a different order than the other (e.g.,
{"a","b"} vs {"b","a"}) while keeping the same PrismElement (PrismElement and
PrismEndpoint fields identical), and assert the same expected error string about
identical topology; this ensures the topology comparison logic
(normalization/sorting of SubnetUUIDs) in the validation still treats
differently ordered but identical subnet sets as duplicates.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/types/nutanix/validation/platform.go`:
- Around line 122-128: The duplicate-topology check using
topoKey/nutanixFailureDomainTopologyKey should be skipped for malformed failure
domains: ensure you only compute topoKey and consult/update fdTopologies after
basic validation that fd.PrismElement UUID is non-empty/valid and fd.Subnets is
non-empty and each subnet is valid; alternatively move the entire
topoKey/fdTopologies block to run after the existing required-field/subnet
validations so that field.Invalid additions for empty/invalid PrismElement or
Subnets occur instead of spurious "identical topology" errors.

---

Nitpick comments:
In `@pkg/types/nutanix/validation/platform_test.go`:
- Around line 433-451: Add a regression test variant to the existing
failure-domain duplicate-topology case that verifies subnet-order invariance:
modify the test case that uses nutanix.Platform / FailureDomain (the case with
Name "fd-1" and "fd-2") so one FailureDomain has SubnetUUIDs in a different
order than the other (e.g., {"a","b"} vs {"b","a"}) while keeping the same
PrismElement (PrismElement and PrismEndpoint fields identical), and assert the
same expected error string about identical topology; this ensures the topology
comparison logic (normalization/sorting of SubnetUUIDs) in the validation still
treats differently ordered but identical subnet sets as duplicates.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6c9bd4ef-5036-40cd-a0f4-1223dd900de4

📥 Commits

Reviewing files that changed from the base of the PR and between 0d798da and e082536.

📒 Files selected for processing (2)
  • pkg/types/nutanix/validation/platform.go
  • pkg/types/nutanix/validation/platform_test.go

Comment thread pkg/types/nutanix/validation/platform.go
@jcpowermac
Copy link
Copy Markdown
Contributor

/ok-to-test

cc: @abhay-nutanix

@openshift-ci openshift-ci Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 19, 2026
@chdeshpa-hue chdeshpa-hue changed the title Bug 86073: validate duplicate Nutanix failureDomain names and topology Bug 86133: validate duplicate Nutanix failureDomain names and topology May 19, 2026
- Move duplicate topology detection after PrismElement UUID and subnet
  validation so users see actionable "required field" errors instead of
  misleading "identical topology" when fields are simply empty
- Use \x00 as subnet separator to eliminate join ambiguity
- Add regression test confirming subnet-order invariance: identical
  subnet sets in different order are correctly detected as duplicates

Bug: https://redhat.atlassian.net/browse/OCPBUGS-86133
Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/types/nutanix/validation/platform.go (1)

134-142: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Skip topology dedupe when subnet list is syntactically present but effectively empty.

Line 134 still allows duplicate-topology checks for malformed SubnetUUIDs like []string{""}, which can reintroduce misleading “identical topology” errors alongside required-subnet errors. Tighten the gate to match validateSubnets’ required check semantics.

Suggested fix
-			if fd.PrismElement.UUID != "" && len(fd.SubnetUUIDs) > 0 {
+			if fd.PrismElement.UUID != "" && len(fd.SubnetUUIDs) > 0 && fd.SubnetUUIDs[0] != "" {
 				topoKey := nutanixFailureDomainTopologyKey(fd)
 				if prevName, exists := fdTopologies[topoKey]; exists {
 					allErrs = append(allErrs, field.Invalid(fldPath.Child("failureDomains").Index(idx), fd.Name,
 						fmt.Sprintf("failure domain %q has identical topology (same prismElement and subnets) as %q; this provides no additional fault tolerance", fd.Name, prevName)))
 				} else {
 					fdTopologies[topoKey] = fd.Name
 				}
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/types/nutanix/validation/platform.go` around lines 134 - 142, The
topology dedupe currently runs when fd.SubnetUUIDs is non-nil or non-zero-length
(e.g., []string{""}), which can falsely detect identical topologies; change the
guard in the block that references nutanixFailureDomainTopologyKey(fd) so it
only proceeds when SubnetUUIDs contains at least one non-empty UUID (for
example: replace len(fd.SubnetUUIDs) > 0 with a helper-style check that scans
for any s != "" or reuse the same required-subnet semantics from
validateSubnets), then keep the rest of the logic (fdTopologies lookup and
assignment) unchanged so malformed empty-string subnet slices are skipped and
won't trigger the duplicate-topology error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/types/nutanix/validation/platform.go`:
- Around line 134-142: The topology dedupe currently runs when fd.SubnetUUIDs is
non-nil or non-zero-length (e.g., []string{""}), which can falsely detect
identical topologies; change the guard in the block that references
nutanixFailureDomainTopologyKey(fd) so it only proceeds when SubnetUUIDs
contains at least one non-empty UUID (for example: replace len(fd.SubnetUUIDs) >
0 with a helper-style check that scans for any s != "" or reuse the same
required-subnet semantics from validateSubnets), then keep the rest of the logic
(fdTopologies lookup and assignment) unchanged so malformed empty-string subnet
slices are skipped and won't trigger the duplicate-topology error.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d5fcaf96-c9bf-42f3-b7f1-70141eaf16da

📥 Commits

Reviewing files that changed from the base of the PR and between e082536 and e540fe7.

📒 Files selected for processing (2)
  • pkg/types/nutanix/validation/platform.go
  • pkg/types/nutanix/validation/platform_test.go

@chdeshpa-hue
Copy link
Copy Markdown
Author

/retest

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 24, 2026

@chdeshpa-hue: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/gofmt e540fe7 link true /test gofmt
ci/prow/e2e-nutanix-ovn e540fe7 link false /test e2e-nutanix-ovn
ci/prow/golint e540fe7 link true /test golint

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants