Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COPY OBJECT | OBJECT IO | Semaphore deadlock fix #7443

Merged
merged 1 commit into from
Aug 21, 2023

Conversation

romayalon
Copy link
Contributor

Explain the changes

The problem -
A BZ reporting on a replication QE test failure was opened, during the BZ investigation 2 problems arose -

  1. A deadlock was noticed in the object io io_buffer_semaphore.
  2. Non-server-side copy acquires the semaphore twice (on read_object_stream() and on upload_stream()), double the size of the to-be-copied object, so it can never be completed with 500M (the default) endpoint memory.

Some more details -
The memory of the endpoint pod was 500M and on large objects (~300MB and up) the replication was not successful due to semaphore timeouts.
The first semaphore timeout was thrown from read_object_stream(), but since we didn't awaited for the promise to be rejected/resolved there was an unhandled rejection, then, the upload_stream() part of the copy was stuck and never released the semaphore.

The fix -

  1. The deadlock fix - chain a catch to the surround_count() and emit an error to the params.source_stream.
  2. The copy double buffer fix - upload_copy() will pass is_copy=true parameter to upload_stream() in order to surround_count 0 size instead of the object size that will be acquired on the read_object_stream().

Additional Suggestions -

  1. Add a work timeout to io_buffer_semaphore in order to release future deadlocks of different flows.
  2. A centralized buffer pool semaphore for the entire pod since we have similar semaphores on namespace_fs etc.

Issues: Fixed #xxx / Gap #xxx

  1. Fixed #https://bugzilla.redhat.com/show_bug.cgi?id=2229124

Testing Instructions:

  1. Create a backingstore bs1 on AWS
  2. Create a backingstore bs2 on AWS
  3. Create a placement bucketclass bc1 on top of bs1
  4. Create a placement bucketclass bc2 on top of bs2
  5. Create an OBC obc1 on top of bc1
  6. Create an OBC obc2 on top of obc2 with a replication policy points to obc1 {"rules":[{"rule_id":"rule1","destination_bucket":"obc1"}]}"
  7. Upload an object to obc2, size should be at least 300MB.
  8. Check if the obj was replicated to obc1
  • Doc added/updated
  • Tests added

Copy link
Contributor

@liranmauda liranmauda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Romy <35330373+romayalon@users.noreply.github.com>
@romayalon romayalon merged commit d27e4f7 into noobaa:master Aug 21, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants