Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLTS - BaseMetadataStore.get does not return deep copy when loading from store. #5853

Closed
sachin-j-joshi opened this issue Mar 17, 2021 · 0 comments · Fixed by #5858
Closed
Assignees
Labels

Comments

@sachin-j-joshi
Copy link
Contributor

Describe the bug
SLTS design requires that each transaction gets it own deep copy of metadata objects so that changes to one copy in one transaction does not affect others.

During conversion from non-async code to async code the bug was introduced that made the deep copy of the wrong object.
This may result in using unintended changes from uncommitted failed transactions.

To Reproduce
The issue reproduces within few days of longevity run.

Expected behavior
BaseMetadataStore.get should always return deep copy to transaction.

Screenshots

Additional context
This issue was seen 3/4 times in longevity tests.
The issue is caused by the exception in table segment (java.util.ConcurrentModificationException) that inorrectly leaves data partially modified (instead of being atomic).
Here is a sequence of events
[1] SLTS recieves a request to append at 141873

2021-03-13 14:27:35,268 66198305 [storage-io-139] DEBUG i.p.s.s.chunklayer.WriteOperation - ChunkedSegmentStorage[2] write - started op=1484422071, segment=longevity/smallScaleTx/519.#epoch.535, offset=141873 length=1752.
2021-03-13 14:27:35,269 66198306 [storage-io-139] DEBUG i.p.s.s.chunklayer.WriteOperation - ChunkedSegmentStorage[2] write - New chunk added - op=1484422071, segment=longevity/smallScaleTx/519.#epoch.535, chunk=longevity/smallScaleTx/519.#epoch.535.E-2-O-141873.36f41341-a087-4082-a515-4545486e41ff, offset=141873.
2021-03-13 14:27:35,282 66198319 [storage-io-139] DEBUG i.p.s.s.c.AsyncBaseChunkStorage - CreateWithContent - chunk=longevity/smallScaleTx/519.#epoch.535.E-2-O-141873.36f41341-a087-4082-a515-4545486e41ff, bytesWritten=1752, latency=12.

This should result in 2 changes

  • 1 KV pair being modified segment metadata to point to the new chunk
  • 1 KV pair for new chunk metadata for new chunk added.
    Both keys are updated in single tableStore.put() call with the expectation that change is atomic. Either both keys change at same time or none.
    [2] The transaction fails becuase of TableSegment error (edited)
12:03
2021-03-13 14:27:35,318 66198355 [storage-io-75] DEBUG i.p.s.s.chunklayer.WriteOperation - ChunkedSegmentStorage[2] write - exception op=1484422071, segment=longevity/smallScaleTx/519.#epoch.535, offset=141873, length=1752.
2021-03-13 14:27:35,621 66198658 [core-14] ERROR i.p.s.server.writer.StorageWriter - StorageWriter[2]: Iteration[303221].Error.
io.pravega.segmentstore.storage.metadata.StorageMetadataException: Transaction failed
        at io.pravega.segmentstore.storage.metadata.TableBasedMetadataStore.handleException(TableBasedMetadataStore.java:200)
        at io.pravega.segmentstore.storage.metadata.TableBasedMetadataStore.lambda$writeAll$8(TableBasedMetadataStore.java:188)
        at java.base/java.util.concurrent.CompletableFuture.uniExceptionally(Unknown Source)
        at java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(Unknown Source)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
        at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
        at io.pravega.common.function.Callbacks.invokeSafely(Callbacks.java:54)
        at io.pravega.common.concurrent.Futures.lambda$exceptionListener$5(Futures.java:277)
        at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
        at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
        at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
        at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
        at io.pravega.common.function.Callbacks.invokeSafely(Callbacks.java:54)
        at io.pravega.segmentstore.server.logs.operations.CompletableOperation.fail(CompletableOperation.java:111)
        at io.pravega.segmentstore.server.logs.OperationProcessor$QueueProcessingState.failOperation(OperationProcessor.java:670)
        at io.pravega.segmentstore.server.logs.OperationProcessor.processOperations(OperationProcessor.java:316)
        at java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(Unknown Source)
        at java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown Source)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.util.ConcurrentModificationException: null
        at java.base/java.util.HashMap$HashIterator.nextNode(Unknown Source)
        at java.base/java.util.HashMap$EntryIterator.next(Unknown Source)
        at java.base/java.util.HashMap$EntryIterator.next(Unknown Source)
        at io.pravega.common.util.CollectionHelpers$ConvertedIterator.next(CollectionHelpers.java:271)
        at com.google.common.collect.Iterators$ConcatenatedIterator.next(Iterators.java:1364)
        at com.google.common.collect.Iterators$1.next(Iterators.java:141)
        at java.base/java.util.HashMap.putMapEntries(Unknown Source)
        at java.base/java.util.HashMap.<init>(Unknown Source)
        at io.pravega.segmentstore.server.logs.SegmentMetadataUpdateTransaction.getAttributes(SegmentMetadataUpdateTransaction.java:148)
        at io.pravega.segmentstore.server.logs.SegmentMetadataUpdateTransaction.<init>(SegmentMetadataUpdateTransaction.java:108)
        at io.pravega.segmentstore.server.logs.ContainerMetadataUpdateTransaction.tryGetSegmentUpdateTransaction(ContainerMetadataUpdateTransaction.java:558)
        at io.pravega.segmentstore.server.logs.ContainerMetadataUpdateTransaction.getSegmentUpdateTransaction(ContainerMetadataUpdateTransaction.java:520)
        at io.pravega.segmentstore.server.logs.ContainerMetadataUpdateTransaction.preProcessOperation(ContainerMetadataUpdateTransaction.java:305)
        at io.pravega.segmentstore.server.logs.OperationMetadataUpdater.preProcessOperation(OperationMetadataUpdater.java:224)
        at io.pravega.segmentstore.server.logs.OperationProcessor.processOperation(OperationProcessor.java:387)
        at io.pravega.segmentstore.server.logs.OperationProcessor.processOperations(OperationProcessor.java:311)
        ... 8 common frames omitted

12:03
[3] The write is retried again but fails with exception that indicates that segment metadata is already modified as a result of last operation.

2021-03-13 14:27:37,116 66200153 [storage-io-52] DEBUG i.p.s.s.chunklayer.WriteOperation - ChunkedSegmentStorage[2] write - started op=1853561620, segment=longevity/smallScaleTx/519.#epoch.535, offset=141873 length=1752.
2021-03-13 14:27:37,644 66200681 [core-20] ERROR i.p.s.server.writer.StorageWriter - StorageWriter[2]: Iteration[303222].Error.
io.pravega.segmentstore.contracts.BadOffsetException: [Segment 'longevity/smallScaleTx/519.#epoch.535'] Bad Offset. Expected 143625, given 141873.
        at io.pravega.segmentstore.storage.chunklayer.WriteOperation.checkState(WriteOperation.java:304)
        at io.pravega.segmentstore.storage.chunklayer.WriteOperation.lambda$call$4(WriteOperation.java:101)
        at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source)
        at java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown Source)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)

[4] However when the read request is received , the table segment again returns incorrect metadata.
ReadOperation based on metadata maps read request to last chunk. (Which means read index is not involved, becuase we already know it is the last chunk)
But the metadata for last chunk is missing because table segment never completed that transaction.

2021-03-13 14:27:39,124 66202161 [storage-io-178] DEBUG i.p.s.s.chunklayer.ReadOperation - ChunkedSegmentStorage[1] read - chunk lookup - op=372890379, segment=_system/containers/storage_metadata_1, offse
t=1585435437, scanned=0, latency=2.
2021-03-13 14:27:39,125 66202162 [storage-io-178] DEBUG i.p.s.s.chunklayer.ReadOperation - ChunkedSegmentStorage[2] read - exception op=1189683104, segment=longevity/smallScaleTx/519.#epoch.535, offset=14
1873, bytesRead=0.
java.lang.IllegalStateException: chunkToReadFrom is null. currentChunkName=longevity/smallScaleTx/519.#epoch.535.E-2-O-141873.36f41341-a087-4082-a515-4545486e41ff Segment=longevity/smallScaleTx/519.#epoch.535
        at com.google.common.base.Preconditions.checkState(Preconditions.java:824)
        at io.pravega.segmentstore.storage.chunklayer.ReadOperation.lambda$findChunkForOffset$15(ReadOperation.java:260)
        at java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(Unknown Source)
        at java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown Source)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Simplified LTS
  
Awaiting triage
1 participant