RPC requests timeouts due to _serial_load semaphore deadlock #7220

romayalon · 2023-03-02T09:17:15Z

Environment info

NooBaa Version: 4.13
Platform: OpenShift 4.13

While investigating https://bugzilla.redhat.com/show_bug.cgi?id=2172624 and searching for information/errors in the logs of all NooBaa's pods, we saw many RPC requests timeouts:
a. Operator logs - full of pings from operator to core without answers.
b. Core logs - all heartbeat() calls throws RPC requests timeout
c. Endpoint logs - NooBaa receives new s3 requests but doesn't return an answer.

After checking the code we realized that before processing RPC request, NooBaa first runs middleware that calls refresh() that calls load() in order to reload system store data.
load() code is surrounded by _serial_load semaphore locking (semaphore counter is 1).
We ran a node debugger and saw that this semaphore lock is taken and not being released and it has thousands of items in its waiting queue.

The 2 points in NooBaa's code that lock this semaphore are:

make_changes()
load()

While reading make_changes() code again, we noticed that inside make_changes_internal() we call refresh(). In a corner case** of refresh(), NooBaa calls load() which loads the system store data under the same lock, but this lock is already acquired by make_changes() - and that's the deadlock, for more info about this change see #6066

** corner cases of refresh that call load -

load() is called and awaited when the system store data last load time > 1 hour ago
load() is called without awaiting when the system store data last load time > 10 minutes and < 60 minutes

Actual behavior

S3 service receives new requests without responding back, RPC requests timeouts all over noobaa's pods logs.

Expected behavior

S3 service receives new requests and replies back, No RPC timeouts.

Steps to reproduce

Not easy to reproduce on a real system, we barely saw this issue.

Change artificially START_REFRESH_THRESHOLD & FORCE_REFRESH_THRESHOLD in order to 0 and we should see this issue again.

More information - Screenshots / Logs / Other output

Some thoughts about solutions:

Taking out the inner lock is the immediate solution.
Adding a timeout for this semaphore can be considered as well.

baum · 2023-03-02T09:23:14Z

PR #811 might be relevant 🖖

guymguym · 2023-03-02T10:22:32Z

@baum this problem is unrelated to the operator

romayalon mentioned this issue Mar 2, 2023

System Store | Move refresh() outside of make_changes_internal() #7221

Merged

2 tasks

romayalon closed this as completed in #7221 Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RPC requests timeouts due to _serial_load semaphore deadlock #7220

RPC requests timeouts due to _serial_load semaphore deadlock #7220

romayalon commented Mar 2, 2023

baum commented Mar 2, 2023

guymguym commented Mar 2, 2023

RPC requests timeouts due to _serial_load semaphore deadlock #7220

RPC requests timeouts due to _serial_load semaphore deadlock #7220

Comments

romayalon commented Mar 2, 2023

Environment info

Actual behavior

Expected behavior

Steps to reproduce

More information - Screenshots / Logs / Other output

baum commented Mar 2, 2023

guymguym commented Mar 2, 2023