Refactor workload-lifecycle e2e to be more fault tolerant #2513

damemi · 2025-02-28T20:49:56Z

The workload-lifecycel e2e can flake because it only runs the traffic generation job once, then loops checking for traces. If auto-instrumentation isn't actually running for a service before the job is generated, the loop is doomed to fail.

This changes how the workload-lifecycle e2e generates test traces and counts them:

Instead of creating the traffic job once, create it on every loop that checks for traces
When checking for traces, use a new jq filter to count the unique occurance of each service name

Unlike the source e2e, which can just check for a minimum number of spans, we have to aggregate the unique services in these checks. That's because in the source e2e, all of the services are tied in a single trace. In this test, the services are separate, so the loop could generate multiple traces for the same service while waiting for others to be ready. This means we can't rely on just checking minimum, because we might hit that minimum even if all services haven't sent traces yet.

We might be able to eventually drop the custom_jq field and just make that the default, but because other jobs might be using the traceql_runner.sh script I'm not doing that yet to not break anything else.

BenElferink

Very nice! 🏆
Do you think this loop can be implemented for cli-upgrade as well?

RonFed · 2025-03-01T14:15:49Z

The reason we are generating the traffic once is since (at leas in theory) we should have made all the necessary asserts that make sure that all the services are instrumented and Odigos is ready. If that is not the case (which is probably what happens today) - it means we either missing an assert or we have bugs.
Those asserts also reflect what we present to the users - if all the InstrumentationInstances are healthy etc' they should expect traces.
This change, although reducing the flakiness - might hide some of the bugs from us IMO.

RonFed

see my comment above

removed approval due to requested changes

github-actions · 2025-05-12T06:42:17Z

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions · 2025-06-16T06:43:43Z

This PR was closed because it has been stale for 30 days with no activity.

Refactor workload-lifecycle

01de4b4

damemi force-pushed the workload-lifecycle-update branch from 3b9f7d9 to 01de4b4 Compare February 28, 2025 20:50

damemi changed the title ~~Workload lifecycle update~~ Refactor workload-lifecycle e2e to be more fault tolerant Feb 28, 2025

Merge branch 'main' into workload-lifecycle-update

b891163

BenElferink previously approved these changes Mar 1, 2025

View reviewed changes

RonFed requested changes Mar 1, 2025

View reviewed changes

BenElferink added the cicd label Mar 6, 2025

github-actions bot added the stale label May 12, 2025

github-actions bot closed this Jun 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor workload-lifecycle e2e to be more fault tolerant #2513

Refactor workload-lifecycle e2e to be more fault tolerant #2513

Uh oh!

damemi commented Feb 28, 2025

Uh oh!

BenElferink left a comment

Uh oh!

RonFed commented Mar 1, 2025

Uh oh!

RonFed left a comment

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactor workload-lifecycle e2e to be more fault tolerant #2513

Refactor workload-lifecycle e2e to be more fault tolerant #2513

Uh oh!

Conversation

damemi commented Feb 28, 2025

Uh oh!

BenElferink left a comment

Choose a reason for hiding this comment

Uh oh!

RonFed commented Mar 1, 2025

Uh oh!

RonFed left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants