🐛 Fix race condition in e2e code coverage collection#2644
Conversation
kubectl scale --replicas=0 is non-blocking and returns as soon as the API server accepts the change, not when pods have terminated. The existing wait on the copy pod was a no-op since it was already running. This meant kubectl cp could run before manager pods had terminated and flushed coverage data to the PVC. Wait for each deployment's .status.replicas to reach 0 before copying, ensuring the Go coverage runtime has written its data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes intermittent e2e coverage loss by ensuring manager deployments have fully scaled down (and thus flushed coverage data to the PVC) before kubectl cp runs.
Changes:
- Replaces “wait for copy pod ready” with waits for both manager deployments’ replica counts to reach 0.
- Adds explicit timeouts to the waits to avoid hanging indefinitely.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
hack/test/e2e-coverage.sh
Outdated
| kubectl -n "$OPERATOR_CONTROLLER_NAMESPACE" wait --for=jsonpath='{.status.replicas}'=0 deployment/"$OPERATOR_CONTROLLER_MANAGER_DEPLOYMENT_NAME" --timeout=60s | ||
| kubectl -n "$CATALOGD_NAMESPACE" wait --for=jsonpath='{.status.replicas}'=0 deployment/"$CATALOGD_MANAGER_DEPLOYMENT_NAME" --timeout=60s |
There was a problem hiding this comment.
DeploymentStatus.replicas is an optional (omitempty) field and may be absent when it is 0. In that case, kubectl wait --for=jsonpath='{.status.replicas}'=0 may never match (empty string != 0) and can time out intermittently. A more robust approach is to wait for the underlying pods to be deleted (e.g., kubectl wait --for=delete pod -l <selector> --timeout=...) or implement a small polling loop that treats empty .status.replicas as 0.
hack/test/e2e-coverage.sh
Outdated
| kubectl -n "$OPERATOR_CONTROLLER_NAMESPACE" wait --for=jsonpath='{.status.replicas}'=0 deployment/"$OPERATOR_CONTROLLER_MANAGER_DEPLOYMENT_NAME" --timeout=60s | ||
| kubectl -n "$CATALOGD_NAMESPACE" wait --for=jsonpath='{.status.replicas}'=0 deployment/"$CATALOGD_MANAGER_DEPLOYMENT_NAME" --timeout=60s |
There was a problem hiding this comment.
Hard-coding --timeout=60s risks CI flakes on slower clusters or during API server pressure (scale-down and termination can exceed 60s). Consider making the timeout configurable (env var with a sensible default) and/or increasing it to a more conservative value to reduce intermittent failures.
kubectl scale --replicas=0 is non-blocking and returns as soon as the API server accepts the change, not when pods have terminated. The existing wait on the copy pod was a no-op since it was already running. This meant kubectl cp could run before manager pods had terminated and flushed coverage data to the PVC. Wait for manager pods to be deleted before copying, ensuring the Go coverage runtime has written its data on process exit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
878686f to
df9cdbe
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2644 +/- ##
==========================================
+ Coverage 68.92% 68.94% +0.02%
==========================================
Files 140 140
Lines 9905 9905
==========================================
+ Hits 6827 6829 +2
+ Misses 2566 2565 -1
+ Partials 512 511 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: camilamacedo86 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
c641e2f
into
operator-framework:main
Description
hack/test/e2e-coverage.shhas a race condition that causes intermittentcoverage data loss in CI.
kubectl scale --replicas=0is non-blocking —it returns as soon as the API server accepts the change, not when pods
have terminated. The existing
wait --for=condition=readyon the copy podwas a no-op since it was already running. This meant
kubectl cpcouldexecute before manager pods had terminated and flushed coverage data to
the PVC.
The fix replaces the no-op wait with
kubectl wait --for=deleteon themanager pods, ensuring they have fully terminated and the Go coverage
runtime has written its data before copying.
Reviewer Checklist