Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OperationProcessor queue may be left closed while the container is in RUNNING state #6021

Closed
RaulGracia opened this issue May 13, 2021 · 2 comments · Fixed by #6085 or #6220
Closed

Comments

@RaulGracia
Copy link
Contributor

Describe the bug
Under a high load scenario, we sporadically face an issue in which:

  1. A Bookie restarts/disconnects due to high load, which lead the Bookkeeper client to throw BKNotEnoughBookiesException and then some Segment Containers restart.

  2. After that, a Segment Store seems that have all the containers recovered:

2021-05-12 15:02:12,342 6791252 [core-6] INFO  i.p.s.s.h.ZKSegmentContainerMonitor - Container Changes: Desired = [81, 18, 4, 85, 86, 39, 9, 61], Current = [81, 18, 4, 85, 86, 39, 9, 61], PendingTasks = [], ToStart = [], ToStop = [].
  1. However, some operations against a Container (the most active one) in that Segment Store are failing:
2021-05-12 14:59:10,680 6609590 [storage-io-102] WARN  i.p.s.s.chunklayer.ReadOperation - ChunkedSegmentStorage[61] read - late op=1793443990, segment=_system/containers/storage_metadata_61$attributes.index, offset=306027643632, bytesRead=15734, latency=114.
io.pravega.common.ObjectClosedException: Object 'PriorityBlockingDrainingQueue' has been closed and cannot be accessed anymore.
...
2021-05-12 15:18:37,872 7776782 [epollEventLoopGroup-10-12] ERROR i.p.s.s.h.h.PravegaRequestProcessor - [requestId=75900249738051584] Error (Segment = 'flink-longevity-0/flink-longevity-0/39.#epoch.0', Operation = 'truncateSegment')
io.pravega.common.ObjectClosedException: Object 'PriorityBlockingDrainingQueue' has been closed and cannot be accessed anymore.
        at io.pravega.common.Exceptions.checkNotClosed(Exceptions.java:255)
...
2021-05-12 15:18:37,872 7776782 [epollEventLoopGroup-10-12] ERROR i.p.s.s.h.h.PravegaRequestProcessor - [requestId=75900249738051584] Error (Segment = 'flink-longevity-0/flink-longevity-0/39.#epoch.0', Operation = 'truncateSegment')
io.pravega.common.ObjectClosedException: Object 'PriorityBlockingDrainingQueue' has been closed and cannot be accessed anymore.
        at io.pravega.common.Exceptions.checkNotClosed(Exceptions.java:255)
  1. However, that Container seems to be able to serve reads (2K reads per second when previous exceptions were thrown according to the metrics).

Could it be a problem in the initialization/shutdown sequence that leaves sometimes the Container as RUNNING but OperationProcessor queue as closed?

To Reproduce
Requires a high load scenario in which Bookies are close to saturation. In this case, SLTS is also enabled. If one or more Bookies are restarted or disconnected, it may be possible to see this issue.

Screenshots
n/a

Additional information
The problem is not persistent: once we manually restart the Segment Store impacted, everything comes back to normal.

@RaulGracia
Copy link
Contributor Author

RaulGracia commented Jun 16, 2021

I think that we have now a clearer picture of this problem. The sequence of events that leads to this problem is as follows:

  1. Due to a network glitch or any other reason, we see problems writing to Bookkeeper and the BookkeeperLog gets closed:
2021-06-15 17:08:50,047 6025143 [core-4] WARN  i.p.s.s.i.bookkeeper.BookKeeperLog - Log[248]: Too many rollover failures; closing.
...
2021-06-15 17:09:38,127 6073223 [core-4] ERROR i.p.s.s.i.bookkeeper.BookKeeperLog - Log[248]: Unable to close LedgerHandle for Ledger 1448.
2021-06-15 17:09:38,127 6073223 [core-4] INFO  i.p.s.s.i.bookkeeper.BookKeeperLog - Log[248]: Closed.
  1. This leads the BookkeeperLog to fail all the writes with CancellationException:
    this.writes.close().forEach(w -> w.fail(new CancellationException("BookKeeperLog has been closed."), true));
2021-06-15 17:09:38,123 6073219 [core-4] WARN  i.p.s.s.logs.OperationProcessor - OperationProcessor[248]: Cancelling 30697 operations with exception: java.util.concurrent.CancellationException: BookKeeperLog has been closed..
  1. These exceptions are handled by the DataFrameBuilder in the handleProcessingException method. In particular, as part of handling the exceptions it invokes this callback:


    Which is basically the this.state::fail method in OperationProcessor:
    val args = new DataFrameBuilder.Args(this.state::frameSealed, this.state::commit, this.state::fail, this.executor);

  2. As part of executing the method this.state::fail, the OperationProcessor executes the errorHandler method as well:

    Callbacks.invokeSafely(OperationProcessor.this::errorHandler, ex, null);

    which in turn closes que operationQueue as part of the closeQueue() method. This explains the continuous ObjectClosedException when invoking OperationProcessor.process().

  3. However, when the OperationProcessor gets the associated exceptions from the failed writes (CancellationException), it does not shut down because this kind of exception is not considered as fatal in this method:

Given that for the OperationProcessor the CancellationException is not fatal, the processor keeps working and does not shut down, despite the operationQueue has been closed. This leaves the OperationProcessor in the inconsisten state we see in the logs.

@RaulGracia
Copy link
Contributor Author

This problem has been detected despite the previous PR, so reopening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment