OperationProcessor queue may be left closed while the container is in RUNNING state #6021

RaulGracia · 2021-05-13T08:17:34Z

Describe the bug
Under a high load scenario, we sporadically face an issue in which:

A Bookie restarts/disconnects due to high load, which lead the Bookkeeper client to throw BKNotEnoughBookiesException and then some Segment Containers restart.
After that, a Segment Store seems that have all the containers recovered:

2021-05-12 15:02:12,342 6791252 [core-6] INFO  i.p.s.s.h.ZKSegmentContainerMonitor - Container Changes: Desired = [81, 18, 4, 85, 86, 39, 9, 61], Current = [81, 18, 4, 85, 86, 39, 9, 61], PendingTasks = [], ToStart = [], ToStop = [].

However, some operations against a Container (the most active one) in that Segment Store are failing:

2021-05-12 14:59:10,680 6609590 [storage-io-102] WARN  i.p.s.s.chunklayer.ReadOperation - ChunkedSegmentStorage[61] read - late op=1793443990, segment=_system/containers/storage_metadata_61$attributes.index, offset=306027643632, bytesRead=15734, latency=114.
io.pravega.common.ObjectClosedException: Object 'PriorityBlockingDrainingQueue' has been closed and cannot be accessed anymore.
...
2021-05-12 15:18:37,872 7776782 [epollEventLoopGroup-10-12] ERROR i.p.s.s.h.h.PravegaRequestProcessor - [requestId=75900249738051584] Error (Segment = 'flink-longevity-0/flink-longevity-0/39.#epoch.0', Operation = 'truncateSegment')
io.pravega.common.ObjectClosedException: Object 'PriorityBlockingDrainingQueue' has been closed and cannot be accessed anymore.
        at io.pravega.common.Exceptions.checkNotClosed(Exceptions.java:255)
...
2021-05-12 15:18:37,872 7776782 [epollEventLoopGroup-10-12] ERROR i.p.s.s.h.h.PravegaRequestProcessor - [requestId=75900249738051584] Error (Segment = 'flink-longevity-0/flink-longevity-0/39.#epoch.0', Operation = 'truncateSegment')
io.pravega.common.ObjectClosedException: Object 'PriorityBlockingDrainingQueue' has been closed and cannot be accessed anymore.
        at io.pravega.common.Exceptions.checkNotClosed(Exceptions.java:255)

However, that Container seems to be able to serve reads (2K reads per second when previous exceptions were thrown according to the metrics).

Could it be a problem in the initialization/shutdown sequence that leaves sometimes the Container as RUNNING but OperationProcessor queue as closed?

To Reproduce
Requires a high load scenario in which Bookies are close to saturation. In this case, SLTS is also enabled. If one or more Bookies are restarted or disconnected, it may be possible to see this issue.

Screenshots
n/a

Additional information
The problem is not persistent: once we manually restart the Segment Store impacted, everything comes back to normal.

The text was updated successfully, but these errors were encountered:

RaulGracia · 2021-06-16T11:25:18Z

I think that we have now a clearer picture of this problem. The sequence of events that leads to this problem is as follows:

Due to a network glitch or any other reason, we see problems writing to Bookkeeper and the BookkeeperLog gets closed:

2021-06-15 17:08:50,047 6025143 [core-4] WARN  i.p.s.s.i.bookkeeper.BookKeeperLog - Log[248]: Too many rollover failures; closing.
...
2021-06-15 17:09:38,127 6073223 [core-4] ERROR i.p.s.s.i.bookkeeper.BookKeeperLog - Log[248]: Unable to close LedgerHandle for Ledger 1448.
2021-06-15 17:09:38,127 6073223 [core-4] INFO  i.p.s.s.i.bookkeeper.BookKeeperLog - Log[248]: Closed.

This leads the BookkeeperLog to fail all the writes with CancellationException:

pravega/segmentstore/storage/impl/src/main/java/io/pravega/segmentstore/storage/impl/bookkeeper/BookKeeperLog.java

Line 185 in 786f25e

this.writes.close().forEach(w -> w.fail(new CancellationException("BookKeeperLog has been closed."), true));

2021-06-15 17:09:38,123 6073219 [core-4] WARN  i.p.s.s.logs.OperationProcessor - OperationProcessor[248]: Cancelling 30697 operations with exception: java.util.concurrent.CancellationException: BookKeeperLog has been closed..

These exceptions are handled by the DataFrameBuilder in the handleProcessingException method. In particular, as part of handling the exceptions it invokes this callback:

pravega/segmentstore/server/src/main/java/io/pravega/segmentstore/server/logs/DataFrameBuilder.java

Line 216 in 786f25e

this.args.commitFailure.accept(ex, commitArgs);

Which is basically the this.state::fail method in OperationProcessor:

pravega/segmentstore/server/src/main/java/io/pravega/segmentstore/server/logs/OperationProcessor.java

Line 108 in 786f25e

val args = new DataFrameBuilder.Args(this.state::frameSealed, this.state::commit, this.state::fail, this.executor);
As part of executing the method this.state::fail, the OperationProcessor executes the errorHandler method as well:

pravega/segmentstore/server/src/main/java/io/pravega/segmentstore/server/logs/OperationProcessor.java

Line 660 in 786f25e

Callbacks.invokeSafely(OperationProcessor.this::errorHandler, ex, null);

which in turn closes que operationQueue as part of the closeQueue() method. This explains the continuous ObjectClosedException when invoking OperationProcessor.process().
However, when the OperationProcessor gets the associated exceptions from the failed writes (CancellationException), it does not shut down because this kind of exception is not considered as fatal in this method:

pravega/segmentstore/server/src/main/java/io/pravega/segmentstore/server/logs/OperationProcessor.java

Line 445 in 786f25e

private static boolean isFatalException(Throwable ex) {

Given that for the OperationProcessor the CancellationException is not fatal, the processor keeps working and does not shut down, despite the operationQueue has been closed. This leaves the OperationProcessor in the inconsisten state we see in the logs.

RaulGracia · 2021-08-11T10:39:20Z

This problem has been detected despite the previous PR, so reopening.

RaulGracia added kind/bug Correctness issue area/segmentstore priority/P2 version/0.10.0 labels May 13, 2021

RaulGracia assigned RaulGracia and andreipaduroiu May 13, 2021

RaulGracia mentioned this issue May 27, 2021

Refine Bookkeeper client re-creation #6033

Closed

RaulGracia mentioned this issue Jun 14, 2021

Issue 6021: OperationProcessor queue may be left closed while the container is in RUNNING state #6085

Merged

andreipaduroiu closed this as completed in #6085 Jun 22, 2021

RaulGracia mentioned this issue Jun 22, 2021

Cherry-pick 6085 into r0.9 #6097

Closed

RaulGracia reopened this Aug 11, 2021

RaulGracia mentioned this issue Aug 11, 2021

Issue 6021: Make sure that commitProcessor can be unblocked if queueProcessor terminates exceptionally #6220

Merged

andreipaduroiu closed this as completed in #6220 Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OperationProcessor queue may be left closed while the container is in RUNNING state #6021

OperationProcessor queue may be left closed while the container is in RUNNING state #6021

RaulGracia commented May 13, 2021

RaulGracia commented Jun 16, 2021 •

edited

RaulGracia commented Aug 11, 2021

OperationProcessor queue may be left closed while the container is in RUNNING state #6021

OperationProcessor queue may be left closed while the container is in RUNNING state #6021

Comments

RaulGracia commented May 13, 2021

RaulGracia commented Jun 16, 2021 • edited

RaulGracia commented Aug 11, 2021

RaulGracia commented Jun 16, 2021 •

edited