🌱 Don't assume there is only one replica in e2e tests #2379

tmshort · 2025-12-10T21:23:27Z

For the upgrade e2e tests, don't assume there is onle one replica. Get the number of replicas from the deployment and wait for the deployment to have that many available. Use the lease to determine the leader pod and reference that.

Note that the name format of leases for operator-controller and catalogd are quite different; this doesn't change that, as it may have an impact on the upgrade test itself.

Assisted-by: Claude Code

Description

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

netlify · 2025-12-10T21:23:34Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`3823eb5`
🔍 Latest deploy log	https://app.netlify.com/projects/olmv1/deploys/693af695027c7800089a19ce
😎 Deploy Preview	https://deploy-preview-2379--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

codecov · 2025-12-10T21:38:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.87%. Comparing base (e4a633e) to head (3823eb5).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2379      +/-   ##
==========================================
- Coverage   74.86%   72.87%   -1.99%     
==========================================
  Files          95       98       +3     
  Lines        7377     7581     +204     
==========================================
+ Hits         5523     5525       +2     
- Misses       1419     1622     +203     
+ Partials      435      434       -1

Flag	Coverage Δ
e2e	`44.62% <ø> (+0.05%)`	⬆️
experimental-e2e	`49.32% <ø> (+0.07%)`	⬆️
unit	`56.82% <ø> (-1.58%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

perdasilva · 2025-12-11T08:04:19Z

test/e2e/metrics_test.go

 		"oper-curl-metrics",
-		metricsURL,
+		"app.kubernetes.io/name=operator-controller",
+		8443,


should we move the ports to consts to improve readability? e.g.

const ( operatorControllerMetricsPort = 8443 catalogdMetricsPort = 7443 )

or something like that?

I kinda wish they were the same value!

But yeah, it looks as though they are already defined.

perdasilva · 2025-12-11T08:33:04Z

test/upgrade-e2e/post_upgrade_test.go

-	return &managerPods.Items[0]
+
+	// Find the leader pod by checking the lease
+	t.Log("Finding the leader pod")


I wonder if we're carrying too much complexity here. It seems we're returning the pod for two reasons:

To at some point check the logs for some specific log lines to do with leader election, e.g.

t.Log("Waiting for acquired leader election") leaderCtx, leaderCancel := context.WithTimeout(ctx, 3*time.Minute) defer leaderCancel() leaderSubstrings := []string{"successfully acquired lease"} leaderElected, err := watchPodLogsForSubstring(leaderCtx, &managerPod, leaderSubstrings...) require.NoError(t, err) require.True(t, leaderElected) t.Log("Reading logs to make sure that ClusterCatalog was reconciled by catalogdv1") logCtx, cancel := context.WithTimeout(ctx, time.Minute) defer cancel() substrings := []string{ "reconcile ending", fmt.Sprintf(`ClusterCatalog=%q`, testClusterCatalogName), } found, err := watchPodLogsForSubstring(logCtx, &managerPod, substrings...) require.NoError(t, err) require.True(t, found)

Because there's a possible bug in catalogd where after a pod restart, the ClusterCatalog still reports that it is serving the catalog. This causes some flakes by making the e2e test progress quickly to the Eventually waiting for installation, which times out because its also implicitly waiting for the catalog to unpack.

I wonder if we could refactor this helper to wait for the Available condition to be True and .status.replicas == .status.updatedReplicas, or something like that, and drop the return value.

I question whether we need 1 at all and we could try to tackle 2. in a different way, e.g. by hitting the catalog service endpoint until it doesn't return a 404. wdyt?

I wonder if changing the test that much is in scope here. This is preserving the test as-is, but making sure the returned pod is correct.

I wonder if we could refactor this helper to wait for the Available condition to be True and .status.replicas == .status.updatedReplicas, or something like that, and drop the return value.

We are already checking updatedReplicas, replicas, availableReplicas and readyReplicas:

operator-controller/test/upgrade-e2e/post_upgrade_test.go

Lines 211 to 217 in 0240528

require.True(ct,

managerDeployment.Status.UpdatedReplicas == *managerDeployment.Spec.Replicas &&

managerDeployment.Status.Replicas == *managerDeployment.Spec.Replicas &&

managerDeployment.Status.AvailableReplicas == *managerDeployment.Spec.Replicas &&

managerDeployment.Status.ReadyReplicas == *managerDeployment.Spec.Replicas,

)

desiredNumReplicas = *managerDeployment.Spec.Replicas

Do we need to do more than that?

Which Available condition are you referring to? Deployment? ClusterCatalog?

On the Deployment

I wonder if changing the test that much is in scope here. This is preserving the test as-is, but making sure the returned pod is correct.

That's fair. I just don't know that we want to carry so much complexity and it might be worth questioning the need for the return value and .zip things. I think it's ok to move it over to another PR (or maybe the cucumber tests might already cover this). But, I thought I'd call it out.

The cucumber stuff will likely rewrite some of that, so I'd say defer it to that, but I can add a quick check for the deployment Available status, since we're already checking things on the deployment.

pedjak · 2025-12-11T12:34:07Z

test/e2e/metrics_test.go

-	require.NoError(t, err, "Error calling metrics endpoint: %s", string(output))
-	require.Contains(t, string(output), "200 OK", "Metrics endpoint did not return 200 OK")
+	// Get all pod IPs for the component
+	podIPs := c.getComponentPodIPs(t)


we could send requests to pod using their FQDN, no need to find out their IPs.

But it appears that the FQDN uses dashed IP addresses, the pods don't have a hostname/subdomain set.
EDIT: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

But it appears that the FQDN uses dashed IP addresses, the pods don't have a hostname/subdomain set. EDIT: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

Ah, you are absolutely right.

nit: you can read all IP addresses for a service (i.e. all pod IPs) by fetching Endpoint resource (the name is the same as of service).

For the upgrade e2e tests, don't assume there is onle one replica. Get the number of replicas from the deployment and wait for the deployment to have that many available. Use the lease to determine the leader pod and reference that. Note that the name format of leases for operator-controller and catalogd are quite different; this doesn't change that, as it may have an impact on the upgrade test itself. Signed-off-by: Todd Short <tshort@redhat.com> Assisted-by: Claude Code

Signed-off-by: Todd Short <tshort@redhat.com>

pedjak

/lgtm

openshift-ci · 2025-12-11T16:55:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pedjak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [pedjak]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tmshort · 2025-12-11T18:11:55Z

It appears there are Github API issues causing some of our test failures.

tmshort · 2025-12-11T18:19:03Z

/override codecov/project

openshift-ci · 2025-12-11T18:19:08Z

@tmshort: Overrode contexts on behalf of tmshort: codecov/project

In response to this:

/override codecov/project

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tmshort requested a review from a team as a code owner December 10, 2025 21:23

openshift-ci bot requested review from bentito and joelanford December 10, 2025 21:23

tmshort mentioned this pull request Dec 10, 2025

🐛 Make deployments HA-ready with configurable replica count and update upgrade-e2e test cases #2371

Open

4 tasks

perdasilva reviewed Dec 11, 2025

View reviewed changes

pedjak reviewed Dec 11, 2025

View reviewed changes

tmshort force-pushed the deployments-e2e branch from 441fe2f to 43b5a78 Compare December 11, 2025 15:24

tmshort added 2 commits December 11, 2025 11:50

Update metrics test to scape all pods

3823eb5

Signed-off-by: Todd Short <tshort@redhat.com>

tmshort force-pushed the deployments-e2e branch from 43b5a78 to 3823eb5 Compare December 11, 2025 16:51

pedjak approved these changes Dec 11, 2025

View reviewed changes

openshift-ci bot assigned pedjak Dec 11, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2025

openshift-merge-bot bot merged commit 6aa4040 into operator-framework:main Dec 11, 2025
30 of 34 checks passed

tmshort deleted the deployments-e2e branch December 11, 2025 18:50

	require.True(ct,
	managerDeployment.Status.UpdatedReplicas == *managerDeployment.Spec.Replicas &&
	managerDeployment.Status.Replicas == *managerDeployment.Spec.Replicas &&
	managerDeployment.Status.AvailableReplicas == *managerDeployment.Spec.Replicas &&
	managerDeployment.Status.ReadyReplicas == *managerDeployment.Spec.Replicas,
	)
	desiredNumReplicas = *managerDeployment.Spec.Replicas

🌱 Don't assume there is only one replica in e2e tests #2379

🌱 Don't assume there is only one replica in e2e tests #2379

Uh oh!

Conversation

tmshort commented Dec 10, 2025

Description

Reviewer Checklist

Uh oh!

netlify bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for olmv1 ready!

Uh oh!

codecov bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

perdasilva Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmshort Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pedjak left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Dec 11, 2025

Uh oh!

tmshort commented Dec 11, 2025

Uh oh!

tmshort commented Dec 11, 2025

Uh oh!

openshift-ci bot commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify bot commented Dec 10, 2025 •

edited

Loading

codecov bot commented Dec 10, 2025 •

edited

Loading

perdasilva Dec 11, 2025 •

edited

Loading

tmshort Dec 11, 2025 •

edited

Loading