Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC requests timeouts due to _serial_load semaphore deadlock #7220

Closed
romayalon opened this issue Mar 2, 2023 · 2 comments · Fixed by #7221
Closed

RPC requests timeouts due to _serial_load semaphore deadlock #7220

romayalon opened this issue Mar 2, 2023 · 2 comments · Fixed by #7221

Comments

@romayalon
Copy link
Contributor

Environment info

  • NooBaa Version: 4.13
  • Platform: OpenShift 4.13

While investigating https://bugzilla.redhat.com/show_bug.cgi?id=2172624 and searching for information/errors in the logs of all NooBaa's pods, we saw many RPC requests timeouts:
a. Operator logs - full of pings from operator to core without answers.
b. Core logs - all heartbeat() calls throws RPC requests timeout
c. Endpoint logs - NooBaa receives new s3 requests but doesn't return an answer.

After checking the code we realized that before processing RPC request, NooBaa first runs middleware that calls refresh() that calls load() in order to reload system store data.
load() code is surrounded by _serial_load semaphore locking (semaphore counter is 1).
We ran a node debugger and saw that this semaphore lock is taken and not being released and it has thousands of items in its waiting queue.

The 2 points in NooBaa's code that lock this semaphore are:

  • make_changes()
  • load()

While reading make_changes() code again, we noticed that inside make_changes_internal() we call refresh(). In a corner case** of refresh(), NooBaa calls load() which loads the system store data under the same lock, but this lock is already acquired by make_changes() - and that's the deadlock, for more info about this change see #6066

** corner cases of refresh that call load -

  1. load() is called and awaited when the system store data last load time > 1 hour ago
  2. load() is called without awaiting when the system store data last load time > 10 minutes and < 60 minutes

Actual behavior

  1. S3 service receives new requests without responding back, RPC requests timeouts all over noobaa's pods logs.

Expected behavior

  1. S3 service receives new requests and replies back, No RPC timeouts.

Steps to reproduce

Not easy to reproduce on a real system, we barely saw this issue.

  1. Change artificially START_REFRESH_THRESHOLD & FORCE_REFRESH_THRESHOLD in order to 0 and we should see this issue again.

More information - Screenshots / Logs / Other output

Some thoughts about solutions:

  1. Taking out the inner lock is the immediate solution.
  2. Adding a timeout for this semaphore can be considered as well.
@baum
Copy link
Contributor

baum commented Mar 2, 2023

PR #811 might be relevant 🖖

@guymguym
Copy link
Member

guymguym commented Mar 2, 2023

@baum this problem is unrelated to the operator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants