You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
SLTS design requires that each transaction gets it own deep copy of metadata objects so that changes to one copy in one transaction does not affect others.
During conversion from non-async code to async code the bug was introduced that made the deep copy of the wrong object.
This may result in using unintended changes from uncommitted failed transactions.
To Reproduce
The issue reproduces within few days of longevity run.
Expected behavior
BaseMetadataStore.get should always return deep copy to transaction.
Screenshots
Additional context
This issue was seen 3/4 times in longevity tests.
The issue is caused by the exception in table segment (java.util.ConcurrentModificationException) that inorrectly leaves data partially modified (instead of being atomic).
Here is a sequence of events
[1] SLTS recieves a request to append at 141873
1 KV pair being modified segment metadata to point to the new chunk
1 KV pair for new chunk metadata for new chunk added.
Both keys are updated in single tableStore.put() call with the expectation that change is atomic. Either both keys change at same time or none.
[2] The transaction fails becuase of TableSegment error (edited)
12:03
2021-03-13 14:27:35,318 66198355 [storage-io-75] DEBUG i.p.s.s.chunklayer.WriteOperation - ChunkedSegmentStorage[2] write - exception op=1484422071, segment=longevity/smallScaleTx/519.#epoch.535, offset=141873, length=1752.
2021-03-13 14:27:35,621 66198658 [core-14] ERROR i.p.s.server.writer.StorageWriter - StorageWriter[2]: Iteration[303221].Error.
io.pravega.segmentstore.storage.metadata.StorageMetadataException: Transaction failed
at io.pravega.segmentstore.storage.metadata.TableBasedMetadataStore.handleException(TableBasedMetadataStore.java:200)
at io.pravega.segmentstore.storage.metadata.TableBasedMetadataStore.lambda$writeAll$8(TableBasedMetadataStore.java:188)
at java.base/java.util.concurrent.CompletableFuture.uniExceptionally(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
at io.pravega.common.function.Callbacks.invokeSafely(Callbacks.java:54)
at io.pravega.common.concurrent.Futures.lambda$exceptionListener$5(Futures.java:277)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
at io.pravega.common.function.Callbacks.invokeSafely(Callbacks.java:54)
at io.pravega.segmentstore.server.logs.operations.CompletableOperation.fail(CompletableOperation.java:111)
at io.pravega.segmentstore.server.logs.OperationProcessor$QueueProcessingState.failOperation(OperationProcessor.java:670)
at io.pravega.segmentstore.server.logs.OperationProcessor.processOperations(OperationProcessor.java:316)
at java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown Source)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.util.ConcurrentModificationException: null
at java.base/java.util.HashMap$HashIterator.nextNode(Unknown Source)
at java.base/java.util.HashMap$EntryIterator.next(Unknown Source)
at java.base/java.util.HashMap$EntryIterator.next(Unknown Source)
at io.pravega.common.util.CollectionHelpers$ConvertedIterator.next(CollectionHelpers.java:271)
at com.google.common.collect.Iterators$ConcatenatedIterator.next(Iterators.java:1364)
at com.google.common.collect.Iterators$1.next(Iterators.java:141)
at java.base/java.util.HashMap.putMapEntries(Unknown Source)
at java.base/java.util.HashMap.<init>(Unknown Source)
at io.pravega.segmentstore.server.logs.SegmentMetadataUpdateTransaction.getAttributes(SegmentMetadataUpdateTransaction.java:148)
at io.pravega.segmentstore.server.logs.SegmentMetadataUpdateTransaction.<init>(SegmentMetadataUpdateTransaction.java:108)
at io.pravega.segmentstore.server.logs.ContainerMetadataUpdateTransaction.tryGetSegmentUpdateTransaction(ContainerMetadataUpdateTransaction.java:558)
at io.pravega.segmentstore.server.logs.ContainerMetadataUpdateTransaction.getSegmentUpdateTransaction(ContainerMetadataUpdateTransaction.java:520)
at io.pravega.segmentstore.server.logs.ContainerMetadataUpdateTransaction.preProcessOperation(ContainerMetadataUpdateTransaction.java:305)
at io.pravega.segmentstore.server.logs.OperationMetadataUpdater.preProcessOperation(OperationMetadataUpdater.java:224)
at io.pravega.segmentstore.server.logs.OperationProcessor.processOperation(OperationProcessor.java:387)
at io.pravega.segmentstore.server.logs.OperationProcessor.processOperations(OperationProcessor.java:311)
... 8 common frames omitted
12:03
[3] The write is retried again but fails with exception that indicates that segment metadata is already modified as a result of last operation.
2021-03-13 14:27:37,116 66200153 [storage-io-52] DEBUG i.p.s.s.chunklayer.WriteOperation - ChunkedSegmentStorage[2] write - started op=1853561620, segment=longevity/smallScaleTx/519.#epoch.535, offset=141873 length=1752.
2021-03-13 14:27:37,644 66200681 [core-20] ERROR i.p.s.server.writer.StorageWriter - StorageWriter[2]: Iteration[303222].Error.
io.pravega.segmentstore.contracts.BadOffsetException: [Segment 'longevity/smallScaleTx/519.#epoch.535'] Bad Offset. Expected 143625, given 141873.
at io.pravega.segmentstore.storage.chunklayer.WriteOperation.checkState(WriteOperation.java:304)
at io.pravega.segmentstore.storage.chunklayer.WriteOperation.lambda$call$4(WriteOperation.java:101)
at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown Source)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
[4] However when the read request is received , the table segment again returns incorrect metadata.
ReadOperation based on metadata maps read request to last chunk. (Which means read index is not involved, becuase we already know it is the last chunk)
But the metadata for last chunk is missing because table segment never completed that transaction.
2021-03-13 14:27:39,124 66202161 [storage-io-178] DEBUG i.p.s.s.chunklayer.ReadOperation - ChunkedSegmentStorage[1] read - chunk lookup - op=372890379, segment=_system/containers/storage_metadata_1, offse
t=1585435437, scanned=0, latency=2.
2021-03-13 14:27:39,125 66202162 [storage-io-178] DEBUG i.p.s.s.chunklayer.ReadOperation - ChunkedSegmentStorage[2] read - exception op=1189683104, segment=longevity/smallScaleTx/519.#epoch.535, offset=14
1873, bytesRead=0.
java.lang.IllegalStateException: chunkToReadFrom is null. currentChunkName=longevity/smallScaleTx/519.#epoch.535.E-2-O-141873.36f41341-a087-4082-a515-4545486e41ff Segment=longevity/smallScaleTx/519.#epoch.535
at com.google.common.base.Preconditions.checkState(Preconditions.java:824)
at io.pravega.segmentstore.storage.chunklayer.ReadOperation.lambda$findChunkForOffset$15(ReadOperation.java:260)
at java.base/java.util.concurrent.CompletableFuture$UniAccept.tryFire(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$Completion.run(Unknown Source)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
The text was updated successfully, but these errors were encountered:
Describe the bug
SLTS design requires that each transaction gets it own deep copy of metadata objects so that changes to one copy in one transaction does not affect others.
During conversion from non-async code to async code the bug was introduced that made the deep copy of the wrong object.
This may result in using unintended changes from uncommitted failed transactions.
To Reproduce
The issue reproduces within few days of longevity run.
Expected behavior
BaseMetadataStore.get should always return deep copy to transaction.
Screenshots
Additional context
This issue was seen 3/4 times in longevity tests.
The issue is caused by the exception in table segment (java.util.ConcurrentModificationException) that inorrectly leaves data partially modified (instead of being atomic).
Here is a sequence of events
[1] SLTS recieves a request to append at
141873
This should result in 2 changes
Both keys are updated in single tableStore.put() call with the expectation that change is atomic. Either both keys change at same time or none.
[2] The transaction fails becuase of TableSegment error (edited)
12:03
[3] The write is retried again but fails with exception that indicates that segment metadata is already modified as a result of last operation.
[4] However when the read request is received , the table segment again returns incorrect metadata.
ReadOperation based on metadata maps read request to last chunk. (Which means read index is not involved, becuase we already know it is the last chunk)
But the metadata for last chunk is missing because table segment never completed that transaction.
The text was updated successfully, but these errors were encountered: