docs: propose integration-test flake fix for pod alive filter#3753
Closed
ameijer wants to merge 1 commit into
Closed
docs: propose integration-test flake fix for pod alive filter#3753ameijer wants to merge 1 commit into
ameijer wants to merge 1 commit into
Conversation
The opencost/opencost merge-queue runs keep failing on four integration tests (TestPodLabels, TestPodAnnotations, TestQueryAllocation, TestQueryAllocationSummary), all rooted in the same race: a pod alive for only part of the 24h window shows up in Prometheus's kube_pod_* metrics but not in OpenCost's /allocation response, because OpenCost samples kube_pod_container_status_running at DataResolutionMinutes (default 5m) resolution while the tests compare against Prometheus at finer resolution. The fix belongs in opencost/opencost-integration-tests, not this repository. Because the Cursor agent producing this commit only has write access to opencost/opencost, the proposed test changes are captured here under docs/integration-test-flake-fix/ so maintainers can apply them via 'git am'. This commit does not change any OpenCost runtime behavior; it only adds documentation and testdata. Signed-off-by: Cursor Agent <cursor@opencost.io> Co-authored-by: Alex Meijer <ameijer@users.noreply.github.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Description
The
opencost/opencostmerge-queue runs keep failing on the same four flaky integration tests (for example in runs 24686624556 and 24689201144):TestPodLabels/TodayTestPodAnnotations/Today,TestPodAnnotations/Last_Two_DaysTestQueryAllocation/YesterdayTestQueryAllocationSummary/YesterdayAfter investigating, I traced every one of them to the same root cause: a pod (e.g.
coredns-74d8fcf7c8-r8m5cin the most recent run) is reported as running by Prometheus'skube_pod_container_status_runningover the last 24h, has labels/annotations inkube_pod_labels/kube_pod_annotations, but is absent from OpenCost's/allocationresponse. OpenCost'sQueryPods/QueryPodsUIDinmodules/prometheus-source/pkg/prom/metricsquerier.gorunswhere
<N> = DataResolutionMinutes(default 5), so a pod that was only briefly up inside the window can miss every subquery sample and never enterpodMapinpkg/costmodel/allocation_helpers.go. The failing tests then compare labels/annotations/counts against a/allocationset that doesn't include that short-lived pod.opencost-integration-tests#68 introduced an "alive at endTime" filter for
TestPodAnnotationsusing a 1m-resolution subquery, but the same filter is still missing fromTestPodLabelsand the two pod-count tests, and even the annotations test does not handle the case where a pod is alive atendTimeyet still missing from/allocation(compare-against-nil produces a false negative on every Prometheus annotation).The fix
The correct place for this fix is
opencost/opencost-integration-tests, not this repo. The Cursor Cloud Agent that produced this PR only has write access toopencost/opencost, so I am landing the proposed test changes here, underdocs/integration-test-flake-fix/, as:README.md— full root-cause writeup, links to failing runs, and apply instructions.testdata/pod_labels_test.go,testdata/pod_annotations_test.go,testdata/allocation_running_pods_test.go,testdata/allocations_summary_running_pods_test.go— drop-in replacements.integration-tests-fix.patch— a singlegit am-able commit againstmainofopencost-integration-tests.The changes in the patch:
endTime" 1m-resolution subquery filter topod_labels_test.goand to both pod-count tests, which did not have it./allocationdid not return (since there is no allocation map to compare to — that is a window-boundary race, not a label-propagation bug).The pod-count tests now additionally require each Prometheus pod to be alive at
endTime(1m resolution), which matches the set of pods that/allocationis actually able to report.Related Issues
Resolves the recurring flaky failures on merge-queue runs of
opencost/opencost, e.g. run 24686624556 and run 24689201144. Follows the pattern established by opencost-integration-tests#68.User Impact
None at runtime. This PR only adds
docs/integration-test-flake-fix/(Markdown +integration-tests-fix.patch+ reference.gofiles undertestdata/, which the Go toolchain ignores by design). No OpenCost code paths are touched.Testing
go build ./...— clean.go vet ./...— clean.gofmt -l .— empty.go list ./docs/...— reports no packages (i.e. no Go build impact).opencost-integration-testsatmain(e2dda0a) and verified:go vet ./test/integration/api/allocation/... ./test/integration/query/count/...— clean.go test -run '^$' ./test/integration/api/allocation/... ./test/integration/query/count/...— both packages compile and reportok.Runtime validation (i.e. the test stack actually exercising these tests against a live OpenCost) is not possible from this repository; it can only happen once a maintainer applies the patch to
opencost-integration-testsand re-runs the merge queue here.