Reduce blocking in main Naemon event loop #4

Merged
merged 2 commits into from Oct 13, 2016

Projects

None yet

4 participants

@jjneely
Contributor
jjneely commented Sep 19, 2016

With the Merlin network IO now being completely handled inside the
Naemon event loop, its possible under high numbers of checks and TCP
congestion / full buffers that we could end up poll()'ing and blocking
for 100ms for each check Naemon processes. This starves the even loop
for resources, Naemon cannot process check results, and we are not
reading from the Merlin sockets. This then cascades to peers as they
begin to block on a full TCP write buffer.

I've reduced the amount of time poll() will block for from 100ms to
10ms. I wonder if we should not block at all in the event loop.

jjneely added some commits Sep 19, 2016
@jjneely jjneely Reduce blocking in main Naemon event loop
With the Merlin network IO now being completely handled inside the
Naemon event loop, its possible under high numbers of checks and TCP
congestion / full buffers that we could end up poll()'ing and blocking
for 100ms for each check Naemon processes.  This starves the even loop
for resources, Naemon cannot process check results, and we are not
reading from the Merlin sockets.  This then cascades to peers as they
being to block on a full TCP write buffer.

I've reduced the amount of time poll() will block for from 100ms to
10ms.  I wonder if we should not block at all in the event loop.
2d0ef4f
@jjneely jjneely Do not block in the Naemon event loop
Instead of reducing blocking in the event loop, do not block at all.
Even with 10ms blocking there are still cases of starvation of the event
loop and inability to process results.
a5bdae1
@jjneely
Contributor
jjneely commented Sep 21, 2016

Updated this PR so that we don't block at all in the Naemon event loop. Even with 10ms I still experienced some small recurrences of event loop starvation / unable to process all fds fast enough.

@pengi
Contributor
pengi commented Oct 13, 2016

To have a poll of 0 might have side effects, but I can see that's how it works in the ipc case. Thus I'll pull this locally for a while to test it out before merging.

Great find however!

@op5-ci-user op5-ci-user merged commit a5bdae1 into op5:master Oct 13, 2016
@mfalkvidd

This fix will be included in tomorrow's release, see https://kb.op5.com/x/34IXAQ for details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment