Skip to content

Conversation

@damemi
Copy link
Member

@damemi damemi commented Feb 28, 2025

The workload-lifecycel e2e can flake because it only runs the traffic generation job once, then loops checking for traces. If auto-instrumentation isn't actually running for a service before the job is generated, the loop is doomed to fail.

This changes how the workload-lifecycle e2e generates test traces and counts them:

  • Instead of creating the traffic job once, create it on every loop that checks for traces
  • When checking for traces, use a new jq filter to count the unique occurance of each service name

Unlike the source e2e, which can just check for a minimum number of spans, we have to aggregate the unique services in these checks. That's because in the source e2e, all of the services are tied in a single trace. In this test, the services are separate, so the loop could generate multiple traces for the same service while waiting for others to be ready. This means we can't rely on just checking minimum, because we might hit that minimum even if all services haven't sent traces yet.

We might be able to eventually drop the custom_jq field and just make that the default, but because other jobs might be using the traceql_runner.sh script I'm not doing that yet to not break anything else.

@damemi damemi force-pushed the workload-lifecycle-update branch from 3b9f7d9 to 01de4b4 Compare February 28, 2025 20:50
@damemi damemi changed the title Workload lifecycle update Refactor workload-lifecycle e2e to be more fault tolerant Feb 28, 2025
BenElferink
BenElferink previously approved these changes Mar 1, 2025
Copy link
Contributor

@BenElferink BenElferink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! 🏆
Do you think this loop can be implemented for cli-upgrade as well?

@RonFed
Copy link
Collaborator

RonFed commented Mar 1, 2025

The reason we are generating the traffic once is since (at leas in theory) we should have made all the necessary asserts that make sure that all the services are instrumented and Odigos is ready. If that is not the case (which is probably what happens today) - it means we either missing an assert or we have bugs.
Those asserts also reflect what we present to the users - if all the InstrumentationInstances are healthy etc' they should expect traces.
This change, although reducing the flakiness - might hide some of the bugs from us IMO.

Copy link
Collaborator

@RonFed RonFed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above

@BenElferink BenElferink dismissed their stale review March 2, 2025 07:52

removed approval due to requested changes

@github-actions
Copy link
Contributor

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label May 12, 2025
@github-actions
Copy link
Contributor

This PR was closed because it has been stale for 30 days with no activity.

@github-actions github-actions bot closed this Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants