Skip to content

Conversation

nick1udwig
Copy link
Member

Problem

Nodes doing heavy work could sometimes freeze. We chased this down to be associated with the event loop queue getting filled & then deadlocking when processing a message that put >1 messages on the event loop.

https://discord.com/channels/1186394868336038080/1186684706700410962/1278437204749844611

Solution

For now, simply increase the queue size. This is a bandaid.

@dr-frmr recommended using an unbounded queue. The reasons I chose to instead increase the queue size & panic, outputting an error message if the queue is full are:

  1. Queues don't fix overload; the long-term correct solution here is not to increase the queue, whether bounded or not: its to add backpressure
  2. The backpressure solution likely looks more like the code in this branch, but with the panic!() replaced by, e.g., the backpressure mechanism (likely a new SendError variant)
  3. Pushing queue size to 100k buys us time to figure out the "correct" solution, while making the highest load case we've seen so far (providerfren.os) still functional
  4. This change is much smaller than the unbounded queue change, because that change implies changing types throughout to UnboundedSender. I wouldn't be opposed to this if I thought that was the "correct" solution, but since I now think the "correct" solution involves backpressure once the queue is full, we'd be changing types in order to change them back once we want to implement the "correct" solution

Testing

Run providerfren.os with develop vs this branch

Docs Update

None

Notes

None

@nick1udwig nick1udwig requested a review from dr-frmr August 29, 2024 00:50
@nick1udwig
Copy link
Member Author

Well shoot. Witnessed another providerfren.os freeze using this binary and without the expected panic!() print. So either I made a mistake setting it up with the new binary or this PR doesn't fix the issue.

@nick1udwig
Copy link
Member Author

Going to rebuild binary & run providerfren.os using it to eliminate that possibility. Hopefully we can witness another crash by tomorrow morning to confirm this doesn't work -- or not and gain confidence I just made a mistake wrt binary.

@nick1udwig
Copy link
Member Author

Yeah, still freezing. Darn.

@nick1udwig nick1udwig requested review from barraguda and removed request for dr-frmr August 30, 2024 07:50
@nick1udwig nick1udwig merged commit e6d8c01 into develop Aug 30, 2024
@nick1udwig nick1udwig deleted the hf/increase-event-loop-queue-size branch August 30, 2024 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant