WIP AGENT-1428: Add e2e extended tests for NoRegistryClusterInstall#30824
WIP AGENT-1428: Add e2e extended tests for NoRegistryClusterInstall#30824pawanpinjarkar wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@pawanpinjarkar: This pull request references AGENT-1428 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Pro Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to data retention organization setting 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughAdds a new Ginkgo-based extended test suite for InternalReleaseImage (IRI) and registers it via a blank import; tests exercise IRI presence, status/conditions, internal registry accessibility, reconciliation after disruption, admission-policy deletion protection, HA across masters, workload image pulls, and MachineSet scaling. Changes
Sequence Diagram(s)sequenceDiagram
participant Tester as Test Runner
participant API as Kubernetes API
participant Registry as Internal Registry
participant Masters as Master Nodes
participant Workers as Worker Nodes
Tester->>API: verify feature gate & list IRI resources
API-->>Tester: return IRI object & status
Tester->>Registry: HTTP health / manifest requests
Registry-->>Tester: HTTP 200 / manifest data
Tester->>Masters: probe registry reachability on each master
Masters-->>Tester: HTTP/ping responses
Tester->>API: simulate disruption (delete/taint resources)
API-->>Tester: acknowledge deletion/taint
Tester->>API: wait and verify reconciliation/restoration
API-->>Tester: restored MachineConfigs/Pods/status
Tester->>API: scale MachineSet (create worker)
API-->>Workers: provision new node
Workers-->>Tester: join cluster and pull images from Registry
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
Skipping CI for Draft Pull Request. |
|
@pawanpinjarkar: This pull request references AGENT-1428 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
test/extended/internalreleaseimage/internalreleaseimage.go (3)
149-175: Test title doesn't match actual validation.The test is titled "be accessible on all master nodes" but only verifies that master nodes have internal IPs. It does not actually test registry accessibility. Consider either:
- Renaming to "have valid internal IPs on all master nodes", or
- Adding actual registry accessibility checks using
checkIRIRegistryAccessible🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 149 - 175, The test labeled in g.It("be accessible on all master nodes [apigroup:machineconfiguration.openshift.io]") only verifies master node internal IPs (it iterates masterNodes.Items and checks nodeIP) but doesn't actually test registry accessibility; either rename the test string to reflect IP validation (e.g., "have valid internal IPs on all master nodes") or extend the loop to call the existing checkIRIRegistryAccessible function for each master nodeIP (use nodeIP and node.Name from the masterNodes iteration) and assert its success so the test truly validates registry accessibility.
700-709: Registry/v2/endpoint may return 401 Unauthorized without credentials.Per the Docker Registry HTTP API v2 spec, the
/v2/endpoint can return401 Unauthorizedwhen authentication is required, which still indicates the registry is accessible. Restricting to only200 OKmay cause false negatives.Suggested fix
- if resp.StatusCode != http.StatusOK { + // Registry v2 API returns 200 OK or 401 Unauthorized (if auth required) + // Both indicate the registry is accessible + if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusUnauthorized { e2e.Logf("Registry responded with unexpected status %d at %s", resp.StatusCode, registryURL) return false } + + // Only check API version header on successful response + if resp.StatusCode == http.StatusOK { + apiVersion := resp.Header.Get("Docker-Distribution-Api-Version") + if apiVersion != "registry/2.0" { + e2e.Logf("Registry API version mismatch at %s: %s", registryURL, apiVersion) + return false + } + } - apiVersion := resp.Header.Get("Docker-Distribution-Api-Version") - if apiVersion != "registry/2.0" { - e2e.Logf("Registry API version mismatch at %s: %s", registryURL, apiVersion) - return false - }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 700 - 709, The current check treats any non-200 response as failure; change the logic so that resp.StatusCode values 200 OK and 401 Unauthorized are considered successful accessibility responses (treat other statuses as failure and log them using registryURL and resp.StatusCode), and only validate the Docker-Distribution-Api-Version header (apiVersion) when resp.StatusCode == http.StatusOK; if resp.StatusCode == http.StatusUnauthorized, log that auth is required and return true without checking apiVersion (use the resp, registryURL, and apiVersion symbols to locate and update the code).
371-372: Replacetime.Sleepwith polling to avoid flakiness.A fixed sleep of 10 seconds may be insufficient in some environments or wasteful in others. Consider polling until the registry is confirmed unavailable on the target node.
Suggested approach
g.By("Waiting for registry to be unavailable on target node") - time.Sleep(10 * time.Second) + err = wait.PollUntilContextTimeout(context.Background(), 2*time.Second, 30*time.Second, true, + func(_ context.Context) (bool, error) { + accessible := checkIRIRegistryAccessible(targetIP) + return !accessible, nil + }) + o.Expect(err).NotTo(o.HaveOccurred(), "Registry should become unavailable on target node")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 371 - 372, The fixed time.Sleep(10 * time.Second) after g.By("Waiting for registry to be unavailable on target node") should be replaced with a polling loop (e.g., wait.PollImmediate or a simple for/select with time.Ticker and timeout) that repeatedly checks the registry reachability and returns as soon as it's unavailable; replace the time.Sleep call with a call to a helper like checkRegistryUnavailableOnNode(node) (or inline an HTTP/TCP probe) and poll until success or timeout, logging progress with g.By and failing the test if the timeout elapses.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/internalreleaseimage/internalreleaseimage.go`:
- Around line 680-684: The TLS client setup in the http.Client transport sets
InsecureSkipVerify but does not set a minimum TLS version; update the TLS config
used in the Transport (the tls.Config instance inside the http.Client
initialization in this file) to include MinVersion: tls.VersionTLS12 (or
tls.VersionTLS13 if you prefer stricter) so the connection enforces a minimum
TLS protocol version while still keeping InsecureSkipVerify for the test
registry.
- Around line 533-540: The scaling call incorrectly uses initialWorkerCount
(total cluster workers) instead of the selected MachineSet's replica count; fix
by reading the MachineSet's current spec.replicas (e.g., via
oc.AsAdmin().Run("get").Args("machineset", machineSetName, "-n",
"openshift-machine-api", "-o", "jsonpath={.spec.replicas}") or by fetching the
MachineSet object in code) and compute targetReplicas := currentReplicas + 1,
then pass fmt.Sprintf("--replicas=%d", targetReplicas) to
oc.AsAdmin().Run("scale").Args(...). Ensure you reference machineSetName and
replace initialWorkerCount in the scale invocation.
---
Nitpick comments:
In `@test/extended/internalreleaseimage/internalreleaseimage.go`:
- Around line 149-175: The test labeled in g.It("be accessible on all master
nodes [apigroup:machineconfiguration.openshift.io]") only verifies master node
internal IPs (it iterates masterNodes.Items and checks nodeIP) but doesn't
actually test registry accessibility; either rename the test string to reflect
IP validation (e.g., "have valid internal IPs on all master nodes") or extend
the loop to call the existing checkIRIRegistryAccessible function for each
master nodeIP (use nodeIP and node.Name from the masterNodes iteration) and
assert its success so the test truly validates registry accessibility.
- Around line 700-709: The current check treats any non-200 response as failure;
change the logic so that resp.StatusCode values 200 OK and 401 Unauthorized are
considered successful accessibility responses (treat other statuses as failure
and log them using registryURL and resp.StatusCode), and only validate the
Docker-Distribution-Api-Version header (apiVersion) when resp.StatusCode ==
http.StatusOK; if resp.StatusCode == http.StatusUnauthorized, log that auth is
required and return true without checking apiVersion (use the resp, registryURL,
and apiVersion symbols to locate and update the code).
- Around line 371-372: The fixed time.Sleep(10 * time.Second) after
g.By("Waiting for registry to be unavailable on target node") should be replaced
with a polling loop (e.g., wait.PollImmediate or a simple for/select with
time.Ticker and timeout) that repeatedly checks the registry reachability and
returns as soon as it's unavailable; replace the time.Sleep call with a call to
a helper like checkRegistryUnavailableOnNode(node) (or inline an HTTP/TCP probe)
and poll until success or timeout, logging progress with g.By and failing the
test if the timeout elapses.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (2)
test/extended/include.gotest/extended/internalreleaseimage/internalreleaseimage.go
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pawanpinjarkar The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
1e880cb to
d70e8e0
Compare
|
@pawanpinjarkar: This pull request references AGENT-1428 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (2)
test/extended/internalreleaseimage/internalreleaseimage.go (2)
691-694:⚠️ Potential issue | 🟡 MinorSet TLS minimum version explicitly in client config.
Line 693 sets
InsecureSkipVerifybut omitsMinVersion; set a minimum protocol version explicitly.Proposed fix
client := &http.Client{ Transport: &http.Transport{ - TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, + TLSClientConfig: &tls.Config{ + InsecureSkipVerify: true, + MinVersion: tls.VersionTLS12, + }, }, Timeout: 10 * time.Second, }In current Go stdlib guidance for net/http clients, what is the recommended tls.Config.MinVersion when MinVersion is not explicitly set?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 691 - 694, The TLS config used for the http.Client sets InsecureSkipVerify but omits a minimum TLS version; update the tls.Config in the http.Transport (the client variable) to include MinVersion: tls.VersionTLS12 (use tls.VersionTLS12) so the config explicitly enforces TLS 1.2 as the minimum while keeping InsecureSkipVerify as currently set.
544-551:⚠️ Potential issue | 🟠 MajorScale target uses total worker count instead of selected MachineSet replicas.
Line 548 should scale from the MachineSet’s current replica count, not cluster-wide worker count. Otherwise, it can overscale/underscale when multiple MachineSets exist.
Proposed fix
+ currentReplicaCount, err := strconv.Atoi(strings.TrimSpace(currentReplicas)) + o.Expect(err).NotTo(o.HaveOccurred(), "Failed to parse MachineSet replicas: %q", currentReplicas) + g.By("Scaling MachineSet up by 1") _, err = oc.AsAdmin().Run("scale").Args( "machineset", machineSetName, "-n", "openshift-machine-api", - fmt.Sprintf("--replicas=%d", initialWorkerCount+1), + fmt.Sprintf("--replicas=%d", currentReplicaCount+1), ).Output() o.Expect(err).NotTo(o.HaveOccurred(), "Failed to scale MachineSet") - e2e.Logf("Scaled MachineSet %s to %d replicas", machineSetName, initialWorkerCount+1) + e2e.Logf("Scaled MachineSet %s to %d replicas", machineSetName, currentReplicaCount+1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 544 - 551, The scale call is using initialWorkerCount (cluster-wide workers) instead of the MachineSet's current replica count, which can mis-scale when multiple MachineSets exist; update the logic around machineSetName to first query the MachineSet's current replicas (e.g., via oc.AsAdmin().Run("get").Args("machineset", machineSetName, "-n", "openshift-machine-api", "-o", "jsonpath={.spec.replicas}") or equivalent), parse that value to an int, compute desiredReplicas = currentReplicas + 1, and then pass fmt.Sprintf("--replicas=%d", desiredReplicas) to the oc.AsAdmin().Run("scale").Args(...) call (replace usage of initialWorkerCount), ensuring error handling for the get/parse steps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/internalreleaseimage/internalreleaseimage.go`:
- Around line 611-620: The test currently only checks newNode.Status.Conditions
for NodeReady but does not perform the promised registry connectivity check; add
an active probe that runs from the new worker node to the registry endpoint on
port 22625 (e.g. create a short-lived Pod scheduled onto newNode.Name or exec
into a probe Pod on that node) that attempts a TCP connect or HTTP request to
the registry host:22625, waits for a successful response/exit code, and assert
that the probe succeeded before or in addition to the existing
o.Expect(nodeReady) check; reference newNode.Name for scheduling and the
registry endpoint/port (22625) when implementing the probe and teardown.
- Around line 678-683: The current filter for machineconfigs uses only
OwnerReferences[0], which is brittle; update the loop that iterates mcList.Items
(where mc.OwnerReferences and iriMCs and IRIResourceName are referenced) to scan
all entries in mc.OwnerReferences and append mc.Name to iriMCs if any ownerRef
has Kind == "InternalReleaseImage" and Name == IRIResourceName, breaking after
append to avoid duplicate adds; remove reliance on indexing and the
len(mc.OwnerReferences) check since the for-range handles empty slices.
- Around line 372-380: The test currently sleeps 10s then only checks other
masters, missing an assertion that the target master actually went down; replace
the fixed time.Sleep with a polling/timeout loop that repeatedly calls
checkIRIRegistryAccessible(masterIPs[0] or the target IP variable) until it
returns false (target unreachable) or a timeout is reached, and fail the test if
the target remains reachable after the timeout; once the target unreachability
is asserted, proceed to verify peers using the existing loop that calls
checkIRIRegistryAccessible(masterIPs[i]) and expects true so HA is validated.
- Around line 508-533: The current check uses the raw JSON string in variable
machinesets to decide whether MachineSets exist, which is unreliable; instead
parse or directly query for an item and skip based on a parsed/queried result:
replace the len(machinesets) == 0 check by either decoding the JSON output into
a struct/slice and verifying len(items) > 0 or by running a jsonpath query (like
the existing jsonpath used to get machineSetName) and skipping when
machineSetName is empty; update the logic around machinesets and machineSetName
(the oc.AsAdmin().Run("get") call that fetches machinesets and the subsequent
jsonpath call that assigns machineSetName) so the existence check is performed
on the parsed result (or machineSetName) before proceeding.
- Around line 150-176: The test titled "be accessible on all master nodes..."
only verifies discovery of NodeInternalIP (masterNodes, nodeIP) but does not
actually probe registry reachability; modify the loop in the test to call an
active probe function (e.g., checkIRIRegistryAccessible(nodeIP)) for each master
after resolving nodeIP and assert its success, ensuring any errors are surfaced
via o.Expect or t/Fatal equivalents; if checkIRIRegistryAccessible does not
exist, implement it to perform a network probe (HTTP/TCP as appropriate) and
return an error/boolean so the test can fail when a master cannot reach the IRI
registry.
---
Duplicate comments:
In `@test/extended/internalreleaseimage/internalreleaseimage.go`:
- Around line 691-694: The TLS config used for the http.Client sets
InsecureSkipVerify but omits a minimum TLS version; update the tls.Config in the
http.Transport (the client variable) to include MinVersion: tls.VersionTLS12
(use tls.VersionTLS12) so the config explicitly enforces TLS 1.2 as the minimum
while keeping InsecureSkipVerify as currently set.
- Around line 544-551: The scale call is using initialWorkerCount (cluster-wide
workers) instead of the MachineSet's current replica count, which can mis-scale
when multiple MachineSets exist; update the logic around machineSetName to first
query the MachineSet's current replicas (e.g., via
oc.AsAdmin().Run("get").Args("machineset", machineSetName, "-n",
"openshift-machine-api", "-o", "jsonpath={.spec.replicas}") or equivalent),
parse that value to an int, compute desiredReplicas = currentReplicas + 1, and
then pass fmt.Sprintf("--replicas=%d", desiredReplicas) to the
oc.AsAdmin().Run("scale").Args(...) call (replace usage of initialWorkerCount),
ensuring error handling for the get/parse steps.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (2)
test/extended/include.gotest/extended/internalreleaseimage/internalreleaseimage.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/extended/include.go
| g.By(fmt.Sprintf("Verifying new worker node %s can access IRI registry", newNode.Name)) | ||
| nodeReady := false | ||
| for _, condition := range newNode.Status.Conditions { | ||
| if condition.Type == corev1.NodeReady && condition.Status == corev1.ConditionTrue { | ||
| nodeReady = true | ||
| break | ||
| } | ||
| } | ||
| o.Expect(nodeReady).To(o.BeTrue(), "New worker node should be Ready, indicating successful registry access") | ||
|
|
There was a problem hiding this comment.
Connectivity claim is not validated; only NodeReady is checked.
Line 611 says registry connectivity is being verified, but Lines 612-620 only assert NodeReady. Add an active connectivity probe from the new worker node to the registry endpoint on port 22625.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 611
- 620, The test currently only checks newNode.Status.Conditions for NodeReady
but does not perform the promised registry connectivity check; add an active
probe that runs from the new worker node to the registry endpoint on port 22625
(e.g. create a short-lived Pod scheduled onto newNode.Name or exec into a probe
Pod on that node) that attempts a TCP connect or HTTP request to the registry
host:22625, waits for a successful response/exit code, and assert that the probe
succeeded before or in addition to the existing o.Expect(nodeReady) check;
reference newNode.Name for scheduling and the registry endpoint/port (22625)
when implementing the probe and teardown.
d70e8e0 to
2d74b43
Compare
|
@pawanpinjarkar: This pull request references AGENT-1428 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
♻️ Duplicate comments (4)
test/extended/internalreleaseimage/internalreleaseimage.go (4)
313-320:⚠️ Potential issue | 🟠 MajorAssert target-node registry outage before validating HA peers.
Line 314 uses a fixed sleep, and the test never verifies the disrupted node actually became unreachable. This can pass even when disruption failed.
Proposed fix
g.By("Waiting for registry to be unavailable on target node") - time.Sleep(10 * time.Second) + err = wait.PollUntilContextTimeout(context.Background(), 2*time.Second, 30*time.Second, true, + func(context.Context) (bool, error) { + return !checkIRIRegistryAccessible(targetIP), nil + }) + o.Expect(err).NotTo(o.HaveOccurred(), + "Target node registry should become unavailable after service stop") g.By("Verifying registry is still accessible on other master nodes")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 313 - 320, The test currently uses a fixed time.Sleep and never asserts the disrupted target node actually went down; replace the sleep with a polling loop that repeatedly calls checkIRIRegistryAccessible(masterIPs[0]) until it returns false (with a configurable timeout and interval) and use o.Expect(false) or o.Expect(accessible).To(o.BeFalse()) to fail the test if the node remains reachable after the timeout; once the loop confirms masterIPs[0] is unreachable, proceed to the existing loop that verifies other masters via checkIRIRegistryAccessible(masterIPs[i]); keep references to checkIRIRegistryAccessible and masterIPs so the change is local to this block.
556-565:⚠️ Potential issue | 🟠 MajorThe “registry connectivity” claim is not validated.
Lines 557-565 only re-check
NodeReady; they don’t probe port22625from the new worker node. This can report success without proving registry reachability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 556 - 565, The current check only inspects newNode.Status.Conditions (the NodeReady loop around newNode) but does not verify registry connectivity on port 22625; replace or augment that validation by executing a network probe from the new node (for example schedule or exec a short pod onto the node or create a Job/DaemonSet targeted to newNode) that attempts a TCP connection to the registry host:22625 and assert success; reference the existing newNode object and the NodeReady check to gate running the probe (i.e., once newNode is Ready run the probe) and fail the test if the TCP dial from the node to port 22625 fails.
449-457:⚠️ Potential issue | 🟠 MajorMachineSet support check is ineffective on raw JSON length.
Line 455 checks
len(machinesets) == 0, but JSON output is non-empty even when.itemsis empty. This can fail late with noisier errors.Proposed fix
- g.By("Checking if cluster supports worker scaling") - machinesets, err := oc.AsAdmin().Run("get").Args( - "machinesets", - "-n", "openshift-machine-api", - "-o", "json", - ).Output() - if err != nil || len(machinesets) == 0 { - g.Skip("Cluster does not support MachineSet scaling (may be bare metal or single node)") - } + g.By("Selecting a MachineSet to scale") + machineSetName, err := oc.AsAdmin().Run("get").Args( + "machinesets", + "-n", "openshift-machine-api", + "-o", "jsonpath={.items[0].metadata.name}", + ).Output() + if err != nil || strings.TrimSpace(machineSetName) == "" { + g.Skip("Cluster does not support MachineSet scaling (may be bare metal or single node)") + } + machineSetName = strings.TrimSpace(machineSetName) + e2e.Logf("Selected MachineSet: %s", machineSetName) g.By("Getting current worker count") @@ - g.By("Selecting a MachineSet to scale") - machineSetName, err := oc.AsAdmin().Run("get").Args( - "machinesets", - "-n", "openshift-machine-api", - "-o", "jsonpath={.items[0].metadata.name}", - ).Output() - o.Expect(err).NotTo(o.HaveOccurred(), "Failed to get MachineSet name") - o.Expect(machineSetName).ShouldNot(o.BeEmpty(), "MachineSet name should not be empty") - e2e.Logf("Selected MachineSet: %s", machineSetName)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 449 - 457, The check using len(machinesets) is unreliable because the JSON string is non-empty even when .items is empty; after calling oc.AsAdmin().Run("get").Args(...).Output() (the machinesets variable), parse/unmarshal that JSON and verify the .items array length (e.g., unmarshal into map[string]interface{} and check items slice length) or alternatively call oc.AsAdmin().Run("get").Args("machinesets","-n","openshift-machine-api","-o","jsonpath={.items}").Output() and ensure the returned items are non-empty before deciding to Skip; update the condition that currently references machinesets to inspect the parsed .items result instead.
158-161:⚠️ Potential issue | 🟡 MinorAvoid assuming
OwnerReferences[0]is the IRI owner.Both Line 160 and Lines 624-627 rely on the first owner ref entry. Ordering is not guaranteed, so this is brittle and can cause false negatives.
Proposed fix
- o.Expect(mc.OwnerReferences[0].Kind).Should(o.Equal("InternalReleaseImage"), - "MachineConfig %s should be owned by InternalReleaseImage", mcName) + ownedByIRI := false + for _, ownerRef := range mc.OwnerReferences { + if ownerRef.Kind == "InternalReleaseImage" && ownerRef.Name == IRIResourceName { + ownedByIRI = true + break + } + } + o.Expect(ownedByIRI).To(o.BeTrue(), + "MachineConfig %s should be owned by InternalReleaseImage", mcName) @@ - for _, mc := range mcList.Items { - if len(mc.OwnerReferences) > 0 && - mc.OwnerReferences[0].Kind == "InternalReleaseImage" && - mc.OwnerReferences[0].Name == IRIResourceName { - iriMCs = append(iriMCs, mc.Name) - } - } + for _, mc := range mcList.Items { + for _, ownerRef := range mc.OwnerReferences { + if ownerRef.Kind == "InternalReleaseImage" && ownerRef.Name == IRIResourceName { + iriMCs = append(iriMCs, mc.Name) + break + } + } + }Also applies to: 623-628
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/internalreleaseimage/internalreleaseimage.go` around lines 158 - 161, The test is brittle because it assumes mc.OwnerReferences[0] is the InternalReleaseImage owner; update the assertions that reference mc.OwnerReferences (the checks around mc and mcName and the similar block at lines ~623-628) to instead verify that at least one OwnerReference entry has Kind == "InternalReleaseImage" (and, if applicable, the expected Name/UID), i.e. scan mc.OwnerReferences for a matching entry rather than indexing [0]; replace the index-based assertions with a predicate-based check that confirms existence of an InternalReleaseImage owner.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test/extended/internalreleaseimage/internalreleaseimage.go`:
- Around line 313-320: The test currently uses a fixed time.Sleep and never
asserts the disrupted target node actually went down; replace the sleep with a
polling loop that repeatedly calls checkIRIRegistryAccessible(masterIPs[0])
until it returns false (with a configurable timeout and interval) and use
o.Expect(false) or o.Expect(accessible).To(o.BeFalse()) to fail the test if the
node remains reachable after the timeout; once the loop confirms masterIPs[0] is
unreachable, proceed to the existing loop that verifies other masters via
checkIRIRegistryAccessible(masterIPs[i]); keep references to
checkIRIRegistryAccessible and masterIPs so the change is local to this block.
- Around line 556-565: The current check only inspects newNode.Status.Conditions
(the NodeReady loop around newNode) but does not verify registry connectivity on
port 22625; replace or augment that validation by executing a network probe from
the new node (for example schedule or exec a short pod onto the node or create a
Job/DaemonSet targeted to newNode) that attempts a TCP connection to the
registry host:22625 and assert success; reference the existing newNode object
and the NodeReady check to gate running the probe (i.e., once newNode is Ready
run the probe) and fail the test if the TCP dial from the node to port 22625
fails.
- Around line 449-457: The check using len(machinesets) is unreliable because
the JSON string is non-empty even when .items is empty; after calling
oc.AsAdmin().Run("get").Args(...).Output() (the machinesets variable),
parse/unmarshal that JSON and verify the .items array length (e.g., unmarshal
into map[string]interface{} and check items slice length) or alternatively call
oc.AsAdmin().Run("get").Args("machinesets","-n","openshift-machine-api","-o","jsonpath={.items}").Output()
and ensure the returned items are non-empty before deciding to Skip; update the
condition that currently references machinesets to inspect the parsed .items
result instead.
- Around line 158-161: The test is brittle because it assumes
mc.OwnerReferences[0] is the InternalReleaseImage owner; update the assertions
that reference mc.OwnerReferences (the checks around mc and mcName and the
similar block at lines ~623-628) to instead verify that at least one
OwnerReference entry has Kind == "InternalReleaseImage" (and, if applicable, the
expected Name/UID), i.e. scan mc.OwnerReferences for a matching entry rather
than indexing [0]; replace the index-based assertions with a predicate-based
check that confirms existence of an InternalReleaseImage owner.
ℹ️ Review info
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (2)
test/extended/include.gotest/extended/internalreleaseimage/internalreleaseimage.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/extended/include.go
2d74b43 to
23e3532
Compare
|
@pawanpinjarkar: This pull request references AGENT-1428 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| e2e.Logf("Status validation complete: all %d releases are valid and healthy", len(iri.Status.Releases)) | ||
| }) | ||
|
|
||
| g.It("restore deleted MachineConfigs [Disruptive][apigroup:machineconfiguration.openshift.io]", func() { |
There was a problem hiding this comment.
I don't think this is a Disruptive test, as it does not alter or reboot a node (at least maybe it could be considered serial, but I guess the code is resilient enough to remain as a parallel one)
| e2e.Logf("Successfully verified restoration of all %d MachineConfigs", originalCount) | ||
| }) | ||
|
|
||
| g.It("prevent deletion when in use [Disruptive][apigroup:machineconfiguration.openshift.io]", func() { |
There was a problem hiding this comment.
Definitely not a disruptive test
| mcClient := oc.MachineConfigurationClient().MachineconfigurationV1alpha1() | ||
|
|
||
| g.By("Verifying exactly one IRI resource exists cluster-wide") | ||
| iriList, err := mcClient.InternalReleaseImages().List(context.Background(), metav1.ListOptions{}) |
There was a problem hiding this comment.
This seems redundant, because just few lines below the IRI is retrieved by name (which is the correct thing to do)
| var oc = exutil.NewCLIWithoutNamespace("no-registry") | ||
|
|
||
| g.BeforeEach(func() { | ||
| skipIfNoRegistryFeatureNotEnabled(oc) |
There was a problem hiding this comment.
This skip is fine (for now), but not enough. I think the skipping logic may be fine tuned to avoid running the IRI tests against unsupported platforms or have false negatives.
- Once the feature will be promoted as GA, the FG skip test should be removed
- Currently, the feature is available only for baremetal and none platform types (any other platform can be safely skipped)
- Having the feature enabled in general is not enough, because it must be activated. In this case the activation is given by the presence of the IRI resource
| o.Expect(err).NotTo(o.HaveOccurred(), "Failed to list InternalReleaseImage resources") | ||
| o.Expect(iriList.Items).Should(o.HaveLen(1)) | ||
|
|
||
| g.By("Getting the InternalReleaseImage resource") |
There was a problem hiding this comment.
That's ok, even though as mentioned before this check should be used in the skipping logic
| e2e.Logf("IRI resource still exists as expected") | ||
| }) | ||
|
|
||
| g.It("maintain high availability of registry across master nodes [Disruptive][apigroup:machineconfiguration.openshift.io]", func() { |
There was a problem hiding this comment.
I have some real concerns about having this test here (for the feature promotion), let's have a discuss about that.
| len(masterIPs)-1, len(masterIPs)) | ||
| }) | ||
|
|
||
| g.It("allow workloads to pull images from internal registry [apigroup:machineconfiguration.openshift.io]", func() { |
There was a problem hiding this comment.
Another example of weird phrasing given the previous Describe behavior (it's not the resource that allows that)
| o.Expect(iri.Status.Releases).ShouldNot(o.BeEmpty(), "IRI should have releases") | ||
|
|
||
| g.By("Selecting an image from the release payload") | ||
| releaseImage := iri.Status.Releases[0].Image |
There was a problem hiding this comment.
Probably any well-known payload image could work as well. But I guess it should not be easy now to retrieve its pullspec (probably it will be after https://issues.redhat.com/browse/AGENT-1312 will be available (cc @danielerez )
| return iriMCs, nil | ||
| } | ||
|
|
||
| func checkIRIRegistryAccessible(ip string) bool { |
There was a problem hiding this comment.
This is not going to work in the CI, since the port is not accessible
| e2e.Logf("Workload successfully pulled and ran image from internal IRI registry") | ||
| }) | ||
|
|
||
| g.It("allow new worker nodes to connect to registry on port 22625 [Disruptive][Slow][apigroup:machineconfiguration.openshift.io]", func() { |
There was a problem hiding this comment.
I don't think this test is going to work in the CI, and I've also found too much complicated, so let's discuss as I think it should be removed
WIP
Summary by CodeRabbit