Skip to content

Conversation

@tmshort
Copy link
Contributor

@tmshort tmshort commented Dec 10, 2025

For the upgrade e2e tests, don't assume there is onle one replica. Get the number of replicas from the deployment and wait for the deployment to have that many available. Use the lease to determine the leader pod and reference that.

Note that the name format of leases for operator-controller and catalogd are quite different; this doesn't change that, as it may have an impact on the upgrade test itself.

Assisted-by: Claude Code

Description

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@tmshort tmshort requested a review from a team as a code owner December 10, 2025 21:23
@openshift-ci openshift-ci bot requested review from bentito and joelanford December 10, 2025 21:23
@netlify
Copy link

netlify bot commented Dec 10, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 3823eb5
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/693af695027c7800089a19ce
😎 Deploy Preview https://deploy-preview-2379--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.87%. Comparing base (e4a633e) to head (3823eb5).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2379      +/-   ##
==========================================
- Coverage   74.86%   72.87%   -1.99%     
==========================================
  Files          95       98       +3     
  Lines        7377     7581     +204     
==========================================
+ Hits         5523     5525       +2     
- Misses       1419     1622     +203     
+ Partials      435      434       -1     
Flag Coverage Δ
e2e 44.62% <ø> (+0.05%) ⬆️
experimental-e2e 49.32% <ø> (+0.07%) ⬆️
unit 56.82% <ø> (-1.58%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

"oper-curl-metrics",
metricsURL,
"app.kubernetes.io/name=operator-controller",
8443,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move the ports to consts to improve readability? e.g.

const (
  operatorControllerMetricsPort = 8443
  catalogdMetricsPort = 7443
)

or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda wish they were the same value!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah, it looks as though they are already defined.

return &managerPods.Items[0]

// Find the leader pod by checking the lease
t.Log("Finding the leader pod")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we're carrying too much complexity here. It seems we're returning the pod for two reasons:

  1. To at some point check the logs for some specific log lines to do with leader election, e.g.
t.Log("Waiting for acquired leader election")
	leaderCtx, leaderCancel := context.WithTimeout(ctx, 3*time.Minute)
	defer leaderCancel()
	leaderSubstrings := []string{"successfully acquired lease"}
	leaderElected, err := watchPodLogsForSubstring(leaderCtx, &managerPod, leaderSubstrings...)
	require.NoError(t, err)
	require.True(t, leaderElected)

t.Log("Reading logs to make sure that ClusterCatalog was reconciled by catalogdv1")
	logCtx, cancel := context.WithTimeout(ctx, time.Minute)
	defer cancel()
	substrings := []string{
		"reconcile ending",
		fmt.Sprintf(`ClusterCatalog=%q`, testClusterCatalogName),
	}
	found, err := watchPodLogsForSubstring(logCtx, &managerPod, substrings...)
	require.NoError(t, err)
	require.True(t, found)
  1. Because there's a possible bug in catalogd where after a pod restart, the ClusterCatalog still reports that it is serving the catalog. This causes some flakes by making the e2e test progress quickly to the Eventually waiting for installation, which times out because its also implicitly waiting for the catalog to unpack.

I wonder if we could refactor this helper to wait for the Available condition to be True and .status.replicas == .status.updatedReplicas, or something like that, and drop the return value.

I question whether we need 1 at all and we could try to tackle 2. in a different way, e.g. by hitting the catalog service endpoint until it doesn't return a 404. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if changing the test that much is in scope here. This is preserving the test as-is, but making sure the returned pod is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could refactor this helper to wait for the Available condition to be True and .status.replicas == .status.updatedReplicas, or something like that, and drop the return value.

We are already checking updatedReplicas, replicas, availableReplicas and readyReplicas:

require.True(ct,
managerDeployment.Status.UpdatedReplicas == *managerDeployment.Spec.Replicas &&
managerDeployment.Status.Replicas == *managerDeployment.Spec.Replicas &&
managerDeployment.Status.AvailableReplicas == *managerDeployment.Spec.Replicas &&
managerDeployment.Status.ReadyReplicas == *managerDeployment.Spec.Replicas,
)
desiredNumReplicas = *managerDeployment.Spec.Replicas

Do we need to do more than that?

Which Available condition are you referring to? Deployment? ClusterCatalog?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Deployment

Copy link
Contributor

@perdasilva perdasilva Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if changing the test that much is in scope here. This is preserving the test as-is, but making sure the returned pod is correct.

That's fair. I just don't know that we want to carry so much complexity and it might be worth questioning the need for the return value and .zip things. I think it's ok to move it over to another PR (or maybe the cucumber tests might already cover this). But, I thought I'd call it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cucumber stuff will likely rewrite some of that, so I'd say defer it to that, but I can add a quick check for the deployment Available status, since we're already checking things on the deployment.

require.NoError(t, err, "Error calling metrics endpoint: %s", string(output))
require.Contains(t, string(output), "200 OK", "Metrics endpoint did not return 200 OK")
// Get all pod IPs for the component
podIPs := c.getComponentPodIPs(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could send requests to pod using their FQDN, no need to find out their IPs.

Copy link
Contributor Author

@tmshort tmshort Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it appears that the FQDN uses dashed IP addresses, the pods don't have a hostname/subdomain set.
EDIT: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it appears that the FQDN uses dashed IP addresses, the pods don't have a hostname/subdomain set. EDIT: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

Ah, you are absolutely right.

nit: you can read all IP addresses for a service (i.e. all pod IPs) by fetching Endpoint resource (the name is the same as of service).

For the upgrade e2e tests, don't assume there is onle one replica.
Get the number of replicas from the deployment and wait for the
deployment to have that many available. Use the lease to determine
the leader pod and reference that.

Note that the name format of leases for operator-controller and catalogd
are quite different; this doesn't change that, as it may have an impact
on the upgrade test itself.

Signed-off-by: Todd Short <tshort@redhat.com>
Assisted-by: Claude Code
Signed-off-by: Todd Short <tshort@redhat.com>
Copy link
Contributor

@pedjak pedjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2025
@openshift-ci
Copy link

openshift-ci bot commented Dec 11, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pedjak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2025
@tmshort
Copy link
Contributor Author

tmshort commented Dec 11, 2025

It appears there are Github API issues causing some of our test failures.

@tmshort
Copy link
Contributor Author

tmshort commented Dec 11, 2025

/override codecov/project

@openshift-ci
Copy link

openshift-ci bot commented Dec 11, 2025

@tmshort: Overrode contexts on behalf of tmshort: codecov/project

In response to this:

/override codecov/project

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 6aa4040 into operator-framework:main Dec 11, 2025
30 of 34 checks passed
@tmshort tmshort deleted the deployments-e2e branch December 11, 2025 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants