COPY OBJECT | OBJECT IO | Semaphore deadlock fix #7443

romayalon · 2023-08-15T10:15:50Z

Explain the changes

The problem -
A BZ reporting on a replication QE test failure was opened, during the BZ investigation 2 problems arose -

A deadlock was noticed in the object io io_buffer_semaphore.
Non-server-side copy acquires the semaphore twice (on read_object_stream() and on upload_stream()), double the size of the to-be-copied object, so it can never be completed with 500M (the default) endpoint memory.

Some more details -
The memory of the endpoint pod was 500M and on large objects (~300MB and up) the replication was not successful due to semaphore timeouts.
The first semaphore timeout was thrown from read_object_stream(), but since we didn't awaited for the promise to be rejected/resolved there was an unhandled rejection, then, the upload_stream() part of the copy was stuck and never released the semaphore.

The fix -

The deadlock fix - chain a catch to the surround_count() and emit an error to the params.source_stream.
The copy double buffer fix - upload_copy() will pass is_copy=true parameter to upload_stream() in order to surround_count 0 size instead of the object size that will be acquired on the read_object_stream().

Additional Suggestions -

Add a work timeout to io_buffer_semaphore in order to release future deadlocks of different flows.
A centralized buffer pool semaphore for the entire pod since we have similar semaphores on namespace_fs etc.

Issues: Fixed #xxx / Gap #xxx

Fixed #https://bugzilla.redhat.com/show_bug.cgi?id=2229124

Testing Instructions:

Create a backingstore bs1 on AWS
Create a backingstore bs2 on AWS
Create a placement bucketclass bc1 on top of bs1
Create a placement bucketclass bc2 on top of bs2
Create an OBC obc1 on top of bc1
Create an OBC obc2 on top of obc2 with a replication policy points to obc1 {"rules":[{"rule_id":"rule1","destination_bucket":"obc1"}]}"
Upload an object to obc2, size should be at least 300MB.
Check if the obj was replicated to obc1

Doc added/updated
Tests added

liranmauda

LGTM

Signed-off-by: Romy <35330373+romayalon@users.noreply.github.com>

pull-request-size bot added the size/S label Aug 15, 2023

romayalon requested review from dannyzaken and guymguym August 15, 2023 10:16

romayalon force-pushed the romy-copy-semaphore branch from d5057e6 to 200e7da Compare August 15, 2023 10:20

liranmauda approved these changes Aug 21, 2023

View reviewed changes

romayalon force-pushed the romy-copy-semaphore branch from 200e7da to d7ca1db Compare August 21, 2023 09:08

object io semaphore deadlock on copy fix

4f7f94d

Signed-off-by: Romy <35330373+romayalon@users.noreply.github.com>

romayalon force-pushed the romy-copy-semaphore branch from d7ca1db to 4f7f94d Compare August 21, 2023 09:41

romayalon merged commit d27e4f7 into noobaa:master Aug 21, 2023
7 checks passed

dannyzaken mentioned this pull request Aug 21, 2023

[backport to 5.14] object io semaphore deadlock on copy fix #7451

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COPY OBJECT | OBJECT IO | Semaphore deadlock fix #7443

COPY OBJECT | OBJECT IO | Semaphore deadlock fix #7443

romayalon commented Aug 15, 2023

liranmauda left a comment

COPY OBJECT | OBJECT IO | Semaphore deadlock fix #7443

COPY OBJECT | OBJECT IO | Semaphore deadlock fix #7443

Conversation

romayalon commented Aug 15, 2023

Explain the changes

Issues: Fixed #xxx / Gap #xxx

Testing Instructions:

liranmauda left a comment

Choose a reason for hiding this comment