Skip to content

docs: propose integration-test flake fix for pod alive filter#3753

Closed
ameijer wants to merge 1 commit into
developfrom
cursor/propose-integration-test-flake-fix-5d65
Closed

docs: propose integration-test flake fix for pod alive filter#3753
ameijer wants to merge 1 commit into
developfrom
cursor/propose-integration-test-flake-fix-5d65

Conversation

@ameijer
Copy link
Copy Markdown
Member

@ameijer ameijer commented Apr 21, 2026

Description

The opencost/opencost merge-queue runs keep failing on the same four flaky integration tests (for example in runs 24686624556 and 24689201144):

  • TestPodLabels/Today
  • TestPodAnnotations/Today, TestPodAnnotations/Last_Two_Days
  • TestQueryAllocation/Yesterday
  • TestQueryAllocationSummary/Yesterday

After investigating, I traced every one of them to the same root cause: a pod (e.g. coredns-74d8fcf7c8-r8m5c in the most recent run) is reported as running by Prometheus's kube_pod_container_status_running over the last 24h, has labels/annotations in kube_pod_labels/kube_pod_annotations, but is absent from OpenCost's /allocation response. OpenCost's QueryPods / QueryPodsUID in modules/prometheus-source/pkg/prom/metricsquerier.go runs

avg(kube_pod_container_status_running{} != 0) by (pod, ns, uid, ...)[24h:<N>m]

where <N> = DataResolutionMinutes (default 5), so a pod that was only briefly up inside the window can miss every subquery sample and never enter podMap in pkg/costmodel/allocation_helpers.go. The failing tests then compare labels/annotations/counts against a /allocation set that doesn't include that short-lived pod.

opencost-integration-tests#68 introduced an "alive at endTime" filter for TestPodAnnotations using a 1m-resolution subquery, but the same filter is still missing from TestPodLabels and the two pod-count tests, and even the annotations test does not handle the case where a pod is alive at endTime yet still missing from /allocation (compare-against-nil produces a false negative on every Prometheus annotation).

The fix

The correct place for this fix is opencost/opencost-integration-tests, not this repo. The Cursor Cloud Agent that produced this PR only has write access to opencost/opencost, so I am landing the proposed test changes here, under docs/integration-test-flake-fix/, as:

  • README.md — full root-cause writeup, links to failing runs, and apply instructions.
  • testdata/pod_labels_test.go, testdata/pod_annotations_test.go, testdata/allocation_running_pods_test.go, testdata/allocations_summary_running_pods_test.go — drop-in replacements.
  • integration-tests-fix.patch — a single git am-able commit against main of opencost-integration-tests.

The changes in the patch:

  1. Apply the PR continue execution on a manually provisioned PV issue #68 "alive at endTime" 1m-resolution subquery filter to pod_labels_test.go and to both pod-count tests, which did not have it.
  2. In both label and annotation tests, also skip pods that /allocation did not return (since there is no allocation map to compare to — that is a window-boundary race, not a label-propagation bug).

The pod-count tests now additionally require each Prometheus pod to be alive at endTime (1m resolution), which matches the set of pods that /allocation is actually able to report.

Related Issues

Resolves the recurring flaky failures on merge-queue runs of opencost/opencost, e.g. run 24686624556 and run 24689201144. Follows the pattern established by opencost-integration-tests#68.

User Impact

None at runtime. This PR only adds docs/integration-test-flake-fix/ (Markdown + integration-tests-fix.patch + reference .go files under testdata/, which the Go toolchain ignores by design). No OpenCost code paths are touched.

Testing

  • go build ./... — clean.
  • go vet ./... — clean.
  • gofmt -l . — empty.
  • go list ./docs/... — reports no packages (i.e. no Go build impact).
  • Applied the proposed patch against a fresh clone of opencost-integration-tests at main (e2dda0a) and verified:
    • go vet ./test/integration/api/allocation/... ./test/integration/query/count/... — clean.
    • go test -run '^$' ./test/integration/api/allocation/... ./test/integration/query/count/... — both packages compile and report ok.

Runtime validation (i.e. the test stack actually exercising these tests against a live OpenCost) is not possible from this repository; it can only happen once a maintainer applies the patch to opencost-integration-tests and re-runs the merge queue here.

Open in Web Open in Cursor 

The opencost/opencost merge-queue runs keep failing on four
integration tests (TestPodLabels, TestPodAnnotations,
TestQueryAllocation, TestQueryAllocationSummary), all rooted in the
same race: a pod alive for only part of the 24h window shows up in
Prometheus's kube_pod_* metrics but not in OpenCost's /allocation
response, because OpenCost samples kube_pod_container_status_running at
DataResolutionMinutes (default 5m) resolution while the tests compare
against Prometheus at finer resolution.

The fix belongs in opencost/opencost-integration-tests, not this
repository. Because the Cursor agent producing this commit only has
write access to opencost/opencost, the proposed test changes are
captured here under docs/integration-test-flake-fix/ so maintainers
can apply them via 'git am'.

This commit does not change any OpenCost runtime behavior; it only adds
documentation and testdata.

Signed-off-by: Cursor Agent <cursor@opencost.io>

Co-authored-by: Alex Meijer <ameijer@users.noreply.github.com>
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant