Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/groupbytrace] Deadlock when eventMachineWorker's events queue is full #33719

Open
vijayaggarwal opened this issue Jun 22, 2024 · 6 comments
Assignees
Labels
bug Something isn't working processor/groupbytrace Group By Trace processor

Comments

@vijayaggarwal
Copy link

Component(s)

processor/groupbytrace

What happened?

Description

I run this processor with num_traces=100_000 (default is 1_000_000) to reduce the worst case memory requirement of this component.

Also, I run this processor with wait_duration=10s (default is 1s) to allow the processor handle larger time intervals in receiving spans of a trace.

Lastly, I use num_workers=1, which is also the default.

If the component gets a burst of traffic and more than num_traces=100_000 traces get into the processor in a short span of time, then the ringBuffer will get filled up and there will be evictions. Also, it is quite likely that eventMachineWorker.events channel's buffer (capacity=10_000/num_workers) will also get filled up with traceReceived events.

Now, when both ring buffer and events buffer are full, processing a traceReceived event will lead to firing of traceRemoved event (due to eviction) and the processor will get deadlocked as the worker will get blocked on pushing the traceRemoved event (since the worker is blocked, the events buffer will never get consumed).

Expected Result

The processor should not get deadlocked.

Actual Result

The processor gets deadlocked.

Collector version

0.95.0, but the issue affects all recent versions of collector

Environment information

The issue is environment agnostic

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@vijayaggarwal vijayaggarwal added bug Something isn't working needs triage New item requiring triage labels Jun 22, 2024
@github-actions github-actions bot added the processor/groupbytrace Group By Trace processor label Jun 22, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling jpkrohling removed the needs triage New item requiring triage label Jul 4, 2024
@jpkrohling
Copy link
Member

@vijayaggarwal , I can totally see what you mean, but would you please create a test case for this? Perhaps we should split the state machine to deal with deletion events separately?

@vijayaggarwal
Copy link
Author

@jpkrohling I can't think of how to capture this in a unit test case as there are multiple functions involved and the initial condition is also non-trivial. If you can point me to some other test case which is even partly similar to this one, I will be happy to give it a shot.

That said, here's a simple description:

Given:
eventMachineWorker.events is full, and eventMachineWorker.buffer is also full.

When:
eventMachineWorker calls eventMachine.handleEvent with e.typ == traceReceived

Then:
eventMachineWorker ends up firing traceRemoved event (in groupByTraceProcessor.onTraceReceived function). Firing an event is a blocking operation and hence the eventMachineWorker stalls (eventMachineWorker.events is full, and its consumer is blocked on producing).

Fundamentally speaking, I guess the eventMachineWorker should not produce events into the eventMachineWorker.events, as its primary role is to consume and not to produce. To maintain a clear separation of roles, we can either split the state machine as you suggested or we can execute the onTraceRemoved logic synchronously in onTraceReceived itself. I have a slight preference for the second approach as I feel it will lead to simpler code.

PS: There's one bit of detail I have skipped above for simplicity - The firing of event happens within doWithTimeout and hence there isn't a permanent deadlock actually. The worker times out after 1s and proceeds with next operation. However, Processing 1 event / s is too slow for the worker to keep up with practical traffic volumes that the processor receives. [This doWithTimeout is placed in handleEventWithObservability. I had overlooked it earlier as I didn't expect non-observability logic there.]

@jpkrohling
Copy link
Member

That's helpful, thank you! Are you experiencing this in a particular scenario? As a workaround, can you increase the buffer size?

@vijayaggarwal
Copy link
Author

@jpkrohling My configuration does make me more susceptible. Particularly,

  1. num_traces=100_000 (default is 1_000_000). Assuming a single trace to be 10 KB, 1_000_000 traces would consume 10 GB memory. So, I reduce to 100_000 to keep max memory consumption at ~1 GB.
  2. wait_duration=10s (default is 1s), as I have cron jobs that take a few seconds to execute.

This significantly increases the chances of the buffer getting full. That said, even with this configuration, I face deadlock only infrequently (like once in a week). For now, I have configured an alert on the num_events_in_queue metric exposed by this component and I simply restart the collector when the alert triggers.

@jpkrohling
Copy link
Member

@vijayaggarwal , I appreciate your feedback, that's very helpful, thank you!

@jpkrohling jpkrohling self-assigned this Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working processor/groupbytrace Group By Trace processor
Projects
None yet
Development

No branches or pull requests

2 participants