-
Notifications
You must be signed in to change notification settings - Fork 68
🌱 Don't assume there is only one replica in e2e tests #2379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 Don't assume there is only one replica in e2e tests #2379
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2379 +/- ##
==========================================
- Coverage 74.86% 72.87% -1.99%
==========================================
Files 95 98 +3
Lines 7377 7581 +204
==========================================
+ Hits 5523 5525 +2
- Misses 1419 1622 +203
+ Partials 435 434 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
test/e2e/metrics_test.go
Outdated
| "oper-curl-metrics", | ||
| metricsURL, | ||
| "app.kubernetes.io/name=operator-controller", | ||
| 8443, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move the ports to consts to improve readability? e.g.
const (
operatorControllerMetricsPort = 8443
catalogdMetricsPort = 7443
)
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda wish they were the same value!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yeah, it looks as though they are already defined.
| return &managerPods.Items[0] | ||
|
|
||
| // Find the leader pod by checking the lease | ||
| t.Log("Finding the leader pod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we're carrying too much complexity here. It seems we're returning the pod for two reasons:
- To at some point check the logs for some specific log lines to do with leader election, e.g.
t.Log("Waiting for acquired leader election")
leaderCtx, leaderCancel := context.WithTimeout(ctx, 3*time.Minute)
defer leaderCancel()
leaderSubstrings := []string{"successfully acquired lease"}
leaderElected, err := watchPodLogsForSubstring(leaderCtx, &managerPod, leaderSubstrings...)
require.NoError(t, err)
require.True(t, leaderElected)
t.Log("Reading logs to make sure that ClusterCatalog was reconciled by catalogdv1")
logCtx, cancel := context.WithTimeout(ctx, time.Minute)
defer cancel()
substrings := []string{
"reconcile ending",
fmt.Sprintf(`ClusterCatalog=%q`, testClusterCatalogName),
}
found, err := watchPodLogsForSubstring(logCtx, &managerPod, substrings...)
require.NoError(t, err)
require.True(t, found)
- Because there's a possible bug in catalogd where after a pod restart, the ClusterCatalog still reports that it is serving the catalog. This causes some flakes by making the e2e test progress quickly to the Eventually waiting for installation, which times out because its also implicitly waiting for the catalog to unpack.
I wonder if we could refactor this helper to wait for the Available condition to be True and .status.replicas == .status.updatedReplicas, or something like that, and drop the return value.
I question whether we need 1 at all and we could try to tackle 2. in a different way, e.g. by hitting the catalog service endpoint until it doesn't return a 404. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if changing the test that much is in scope here. This is preserving the test as-is, but making sure the returned pod is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could refactor this helper to wait for the Available condition to be True and .status.replicas == .status.updatedReplicas, or something like that, and drop the return value.
We are already checking updatedReplicas, replicas, availableReplicas and readyReplicas:
operator-controller/test/upgrade-e2e/post_upgrade_test.go
Lines 211 to 217 in 0240528
| require.True(ct, | |
| managerDeployment.Status.UpdatedReplicas == *managerDeployment.Spec.Replicas && | |
| managerDeployment.Status.Replicas == *managerDeployment.Spec.Replicas && | |
| managerDeployment.Status.AvailableReplicas == *managerDeployment.Spec.Replicas && | |
| managerDeployment.Status.ReadyReplicas == *managerDeployment.Spec.Replicas, | |
| ) | |
| desiredNumReplicas = *managerDeployment.Spec.Replicas |
Do we need to do more than that?
Which Available condition are you referring to? Deployment? ClusterCatalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Deployment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if changing the test that much is in scope here. This is preserving the test as-is, but making sure the returned pod is correct.
That's fair. I just don't know that we want to carry so much complexity and it might be worth questioning the need for the return value and .zip things. I think it's ok to move it over to another PR (or maybe the cucumber tests might already cover this). But, I thought I'd call it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cucumber stuff will likely rewrite some of that, so I'd say defer it to that, but I can add a quick check for the deployment Available status, since we're already checking things on the deployment.
| require.NoError(t, err, "Error calling metrics endpoint: %s", string(output)) | ||
| require.Contains(t, string(output), "200 OK", "Metrics endpoint did not return 200 OK") | ||
| // Get all pod IPs for the component | ||
| podIPs := c.getComponentPodIPs(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could send requests to pod using their FQDN, no need to find out their IPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it appears that the FQDN uses dashed IP addresses, the pods don't have a hostname/subdomain set.
EDIT: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it appears that the FQDN uses dashed IP addresses, the pods don't have a hostname/subdomain set. EDIT: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
Ah, you are absolutely right.
nit: you can read all IP addresses for a service (i.e. all pod IPs) by fetching Endpoint resource (the name is the same as of service).
441fe2f to
43b5a78
Compare
For the upgrade e2e tests, don't assume there is onle one replica. Get the number of replicas from the deployment and wait for the deployment to have that many available. Use the lease to determine the leader pod and reference that. Note that the name format of leases for operator-controller and catalogd are quite different; this doesn't change that, as it may have an impact on the upgrade test itself. Signed-off-by: Todd Short <tshort@redhat.com> Assisted-by: Claude Code
Signed-off-by: Todd Short <tshort@redhat.com>
43b5a78 to
3823eb5
Compare
pedjak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pedjak The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
It appears there are Github API issues causing some of our test failures. |
|
/override codecov/project |
|
@tmshort: Overrode contexts on behalf of tmshort: codecov/project In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
6aa4040
into
operator-framework:main
For the upgrade e2e tests, don't assume there is onle one replica. Get the number of replicas from the deployment and wait for the deployment to have that many available. Use the lease to determine the leader pod and reference that.
Note that the name format of leases for operator-controller and catalogd are quite different; this doesn't change that, as it may have an impact on the upgrade test itself.
Assisted-by: Claude Code
Description
Reviewer Checklist