-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements for dealing with eventually-consistent stores (S3) #437
Conversation
@pkj415 PTAL |
Instead of compaction immediately deleting source index blobs, we now write log entries (with `m` prefix) which are merged on reads and applied only if the blob list includes all inputs and outputs, in which case the inputs are discarded since they are known to have been superseded by the outputs. This addresses eventual consistency issues in stores such as S3, which don't guarantee list-after-put or list-after-delete. With such stores the repository is ultimately eventually consistent and there's not much that can be done about it, unless we use second strongly consistent storage (such as GCS) for the index only.
Thi keeps track of which blobs (n and m) have been written by the local repository client, so that even if the storage listing is eventually consistent (as in S3), we get somewhat sane behavior. Note that this is still assumming read-after-create semantics, which S3 also guarantees, otherwise it's very hard to do anything useful.
- pass CachingOptions by pointer, since it's already pretty big - split content.NewManager() into two functions
Clearing cache requires closing repository first, as Windows is holding the files locked. This requires ability to close the repositroy twice.
list() can now show recently deleted items.
This works by using N parallel "actors", each repeatedly performing operations on indexBlobManagers all sharing single eventually consistent storage. Each actor runs in a loop and randomly selects between: - *reading* all contents in indexes and verifying that it includes all contents written by the actor so far and that contents are correctly marked as deleted - *creating* new contents - *deleting* one of previously-created contents (by the same actor) - *compacting* all index files into one The test runs on accelerated time (every read of time moves it by 0.1 seconds) and simulates several hours of running. In case of a failure, the log should provide enough debugging information to trace the exact sequence of events leading up to the failure - each log line is prefixed with actorID and all storage access is logged.
…BlobManagerStress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strictly better, that is, it improves the robustness when handling eventually consistent stores in so many ways. It makes sense to merge this PR as is and address the remaining issues and comments in a separate PR.
@jkowalski WDYT?
repo/content/index_blob_manager.go
Outdated
return results, nil | ||
} | ||
|
||
func (m *indexBlobManagerImpl) deleteOldIndexBlobs(ctx context.Context, latestBlob blob.Metadata) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we be having anything in maintenance runs to delete old index blobs, log compaction entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add it, sure, but assuming we do semi-regular compaction things will clean themselves up.
repo/content/index_blob_manager.go
Outdated
return nil, errors.Wrap(err, "error merging local writes for compaction log entries") | ||
} | ||
|
||
storageIndexBlobs, err := m.listCache.listIndexBlobs(ctx, indexBlobPrefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making two calls to listIndexBlobs with different prefixes one after the other is an issue here, consider this -
- As part of some process X, the first call to list compaction logs gave us the blobs existing.
- The process X then stalled for some reason for a long time (hours) before listing all index blobs.
- Another process Y came in and wrote a compaction log blob, compacting some index blobs say A and B.
- After some time, another process Z, removed index blobs A and B.
- Just after process Z removed A, process X resumed again to now read index blob B. This is an issue as it did not read the earlier compaction blob that was written.
Maybe we should first list all index blobs, and then all compaction log blobs.
It's actually not needed: If you do compaction at time `t0`, it reads some index files (say `Na`, `Nb`, `Nc`) and creates new blob `Nn` and log entry `Mn` which says: ``` { inputs: ["Na","Nb","Nc"], outputs: ["Nn"] } ``` We now have two cases depending on time: a) between [t0,t0 + minIndexBlobDeleteAge): The eventual consistency for both `Nn` and `Mn` is at play - they may or may not be observed by readers. The original writer will see `Nn` and `Mn` due to own-writes cache. Other readers will not necessarily see `Nn` and/or `Mn` but they will for sure see `Na`, `Nb`, `Nc` because we have not deleted them yet. Ultimately repository will read and merge the indexes and: ``` merge(Na, Nb, Nc) == merge(Na,Nb,Nc,Nn) ``` b) after t0 + minIndexBlobDeleteAge During this time we are free to delete `Na`, `Nb`, and `Nc` and 'Mn', but that's ok since we've waited enough time for `Nn` to be visible to all readers. The readers will see `Nn` and possibly some subset of {`Na`, `Nb`, `Nc`}, but that's ok since ``` merge(Na, Nb, Nc, Nn) == merge(Nn) merge(Na, Nb, Nn) == merge(Nn) merge(Na, Nc, Nn) == merge(Nn) merge(Nb, Nc, Nn) == merge(Nn) merge(Na, Nn) == merge(Nn) merge(Nb, Nn) == merge(Nn) merge(Nc, Nn) == merge(Nn) ```
The race is where if we delete compaction log too early, it may lead to previously deleted contents becoming temporarily live again to an outside observer. Added test case that reproduces the issue, verified that it fails without the fix and passed with one.
I finally discovered what's wrong with the stress test - it boils down to unsafe concurrent compaction done by two actors: The sequence:
|
- better logging to be able to trace the root cause in case of a failure - prevented concurrent compaction which is unsafe: The sequence: 1. A creates contentA1 in INDEX-1 2. B creates contentB1 in INDEX-2 3. A deletes contentA1 in INDEX-3 4. B does compaction, but is not seeing INDEX-3 (due to EC or simply because B started read before #3 completed), so it writes INDEX-4==merge(INDEX-1,INDEX-2) * INDEX-4 has contentA1 as active 5. A does compaction but it's not seeing INDEX-4 yet (due to EC or because read started before #4), so it drops contentA1, writes INDEX-5=merge(INDEX-1,INDEX-2,INDEX-3) * INDEX-5 does not have contentA1 7. C sees INDEX-5 and INDEX-5 and merge(INDEX-4,INDEX-5) contains contentA1 which is wrong, because A has been deleted (and there's no record of it anywhere in the system)
I'm going to disable automatic compaction that happens when repository is first opened and move it to maintenance. |
…ch time by adding 32 random bytes
… all index blob IDs are different
@@ -306,6 +300,12 @@ func (m *indexBlobManagerImpl) deleteOldBlobs(ctx context.Context, latestBlob bl | |||
return errors.Wrap(err, "unable to delete compaction logs") | |||
} | |||
|
|||
compactionLogBlobsToDelayCleanup := m.findCompactionLogBlobsToDelayCleanup(ctx, compactionBlobs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor nit: we don't need to write cleanup blobs for compaction blobs that already had a cleanup blob and will end up being deleted below (line 313)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ✅
🚢it
🚀
Fixes #326