You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While investigating https://bugzilla.redhat.com/show_bug.cgi?id=2172624 and searching for information/errors in the logs of all NooBaa's pods, we saw many RPC requests timeouts:
a. Operator logs - full of pings from operator to core without answers.
b. Core logs - all heartbeat() calls throws RPC requests timeout
c. Endpoint logs - NooBaa receives new s3 requests but doesn't return an answer.
After checking the code we realized that before processing RPC request, NooBaa first runs middleware that calls refresh() that calls load() in order to reload system store data.
load() code is surrounded by _serial_load semaphore locking (semaphore counter is 1).
We ran a node debugger and saw that this semaphore lock is taken and not being released and it has thousands of items in its waiting queue.
The 2 points in NooBaa's code that lock this semaphore are:
make_changes()
load()
While reading make_changes() code again, we noticed that inside make_changes_internal() we call refresh(). In a corner case** of refresh(), NooBaa calls load() which loads the system store data under the same lock, but this lock is already acquired by make_changes() - and that's the deadlock, for more info about this change see #6066
** corner cases of refresh that call load -
load() is called and awaited when the system store data last load time > 1 hour ago
load() is called without awaiting when the system store data last load time > 10 minutes and < 60 minutes
Actual behavior
S3 service receives new requests without responding back, RPC requests timeouts all over noobaa's pods logs.
Expected behavior
S3 service receives new requests and replies back, No RPC timeouts.
Steps to reproduce
Not easy to reproduce on a real system, we barely saw this issue.
Change artificially START_REFRESH_THRESHOLD & FORCE_REFRESH_THRESHOLD in order to 0 and we should see this issue again.
More information - Screenshots / Logs / Other output
Some thoughts about solutions:
Taking out the inner lock is the immediate solution.
Adding a timeout for this semaphore can be considered as well.
The text was updated successfully, but these errors were encountered:
Environment info
While investigating https://bugzilla.redhat.com/show_bug.cgi?id=2172624 and searching for information/errors in the logs of all NooBaa's pods, we saw many RPC requests timeouts:
a. Operator logs - full of pings from operator to core without answers.
b. Core logs - all heartbeat() calls throws RPC requests timeout
c. Endpoint logs - NooBaa receives new s3 requests but doesn't return an answer.
After checking the code we realized that before processing RPC request, NooBaa first runs middleware that calls refresh() that calls load() in order to reload system store data.
load() code is surrounded by _serial_load semaphore locking (semaphore counter is 1).
We ran a node debugger and saw that this semaphore lock is taken and not being released and it has thousands of items in its waiting queue.
The 2 points in NooBaa's code that lock this semaphore are:
While reading make_changes() code again, we noticed that inside make_changes_internal() we call refresh(). In a corner case** of refresh(), NooBaa calls load() which loads the system store data under the same lock, but this lock is already acquired by make_changes() - and that's the deadlock, for more info about this change see #6066
** corner cases of refresh that call load -
Actual behavior
Expected behavior
Steps to reproduce
Not easy to reproduce on a real system, we barely saw this issue.
More information - Screenshots / Logs / Other output
Some thoughts about solutions:
The text was updated successfully, but these errors were encountered: