Improvements for dealing with eventually-consistent stores (S3) #437

jkowalski · 2020-05-12T03:47:20Z

added compaction log which defers deletion of old indexes until later
added cache of local writes which is merged with backend List results

Fixes #326

repo/content/content_manager_indexes.go

julio-lopez · 2020-05-16T22:58:15Z

@pkj415 PTAL

Instead of compaction immediately deleting source index blobs, we now write log entries (with `m` prefix) which are merged on reads and applied only if the blob list includes all inputs and outputs, in which case the inputs are discarded since they are known to have been superseded by the outputs. This addresses eventual consistency issues in stores such as S3, which don't guarantee list-after-put or list-after-delete. With such stores the repository is ultimately eventually consistent and there's not much that can be done about it, unless we use second strongly consistent storage (such as GCS) for the index only.

Thi keeps track of which blobs (n and m) have been written by the local repository client, so that even if the storage listing is eventually consistent (as in S3), we get somewhat sane behavior. Note that this is still assumming read-after-create semantics, which S3 also guarantees, otherwise it's very hard to do anything useful.

- pass CachingOptions by pointer, since it's already pretty big - split content.NewManager() into two functions

Clearing cache requires closing repository first, as Windows is holding the files locked. This requires ability to close the repositroy twice.

list() can now show recently deleted items.

This works by using N parallel "actors", each repeatedly performing operations on indexBlobManagers all sharing single eventually consistent storage. Each actor runs in a loop and randomly selects between: - *reading* all contents in indexes and verifying that it includes all contents written by the actor so far and that contents are correctly marked as deleted - *creating* new contents - *deleting* one of previously-created contents (by the same actor) - *compacting* all index files into one The test runs on accelerated time (every read of time moves it by 0.1 seconds) and simulates several hours of running. In case of a failure, the log should provide enough debugging information to trace the exact sequence of events leading up to the failure - each log line is prefixed with actorID and all storage access is logged.

…nagerStress

…BlobManagerStress

repo/content/content_manager_lock_free.go

repo/content/content_manager_own_writes.go

repo/content/content_manager_indexes.go

repo/content/index_blob_manager.go

repo/content/index_blob_manager_test.go

internal/blobtesting/eventually_consistent.go

julio-lopez

This is strictly better, that is, it improves the robustness when handling eventually consistent stores in so many ways. It makes sense to merge this PR as is and address the remaining issues and comments in a separate PR.

@jkowalski WDYT?

repo/content/index_blob_manager_test.go

repo/content/index_blob_manager.go

repo/content/content_manager_indexes.go

repo/content/index_blob_manager.go

pkj415 · 2020-05-28T14:42:37Z

repo/content/index_blob_manager.go

+	return results, nil
+}
+
+func (m *indexBlobManagerImpl) deleteOldIndexBlobs(ctx context.Context, latestBlob blob.Metadata) error {


Would we be having anything in maintenance runs to delete old index blobs, log compaction entries?

we can add it, sure, but assuming we do semi-regular compaction things will clean themselves up.

repo/content/index_blob_manager.go

pkj415 · 2020-05-28T15:08:34Z

repo/content/index_blob_manager.go

+		return nil, errors.Wrap(err, "error merging local writes for compaction log entries")
+	}
+
+	storageIndexBlobs, err := m.listCache.listIndexBlobs(ctx, indexBlobPrefix)


Making two calls to listIndexBlobs with different prefixes one after the other is an issue here, consider this -

As part of some process X, the first call to list compaction logs gave us the blobs existing.

The process X then stalled for some reason for a long time (hours) before listing all index blobs.

Another process Y came in and wrote a compaction log blob, compacting some index blobs say A and B.

After some time, another process Z, removed index blobs A and B.

Just after process Z removed A, process X resumed again to now read index blob B. This is an issue as it did not read the earlier compaction blob that was written.

Maybe we should first list all index blobs, and then all compaction log blobs.

repo/content/content_manager_own_writes.go

It's actually not needed: If you do compaction at time `t0`, it reads some index files (say `Na`, `Nb`, `Nc`) and creates new blob `Nn` and log entry `Mn` which says: ``` { inputs: ["Na","Nb","Nc"], outputs: ["Nn"] } ``` We now have two cases depending on time: a) between [t0,t0 + minIndexBlobDeleteAge): The eventual consistency for both `Nn` and `Mn` is at play - they may or may not be observed by readers. The original writer will see `Nn` and `Mn` due to own-writes cache. Other readers will not necessarily see `Nn` and/or `Mn` but they will for sure see `Na`, `Nb`, `Nc` because we have not deleted them yet. Ultimately repository will read and merge the indexes and: ``` merge(Na, Nb, Nc) == merge(Na,Nb,Nc,Nn) ``` b) after t0 + minIndexBlobDeleteAge During this time we are free to delete `Na`, `Nb`, and `Nc` and 'Mn', but that's ok since we've waited enough time for `Nn` to be visible to all readers. The readers will see `Nn` and possibly some subset of {`Na`, `Nb`, `Nc`}, but that's ok since ``` merge(Na, Nb, Nc, Nn) == merge(Nn) merge(Na, Nb, Nn) == merge(Nn) merge(Na, Nc, Nn) == merge(Nn) merge(Nb, Nc, Nn) == merge(Nn) merge(Na, Nn) == merge(Nn) merge(Nb, Nn) == merge(Nn) merge(Nc, Nn) == merge(Nn) ```

The race is where if we delete compaction log too early, it may lead to previously deleted contents becoming temporarily live again to an outside observer. Added test case that reproduces the issue, verified that it fails without the fix and passed with one.

jkowalski · 2020-05-31T00:13:25Z

I finally discovered what's wrong with the stress test - it boils down to unsafe concurrent compaction done by two actors:

The sequence:

A creates contentA1 in INDEX-1
B creates contentB1 in INDEX-2
A deletes contentA1 in INDEX-3
B does compaction, but is not seeing INDEX-3 (due to EC), so it writes INDEX-4==merge(INDEX-1,INDEX-2)
INDEX-4 has contentA1 as active
A does compaction but it's not seeing INDEX-4 yet (due to EC), so it drops contentA1, writes INDEX-5=merge(INDEX-1,INDEX-2,INDEX-3)
INDEX-5 does not have contentA1
C sees INDEX-5 and INDEX-5 and merge(INDEX-4,INDEX-5) contains contentA1 which is wrong, because A has been deleted (and there's no record of it anywhere in the system)

- better logging to be able to trace the root cause in case of a failure - prevented concurrent compaction which is unsafe: The sequence: 1. A creates contentA1 in INDEX-1 2. B creates contentB1 in INDEX-2 3. A deletes contentA1 in INDEX-3 4. B does compaction, but is not seeing INDEX-3 (due to EC or simply because B started read before #3 completed), so it writes INDEX-4==merge(INDEX-1,INDEX-2) * INDEX-4 has contentA1 as active 5. A does compaction but it's not seeing INDEX-4 yet (due to EC or because read started before #4), so it drops contentA1, writes INDEX-5=merge(INDEX-1,INDEX-2,INDEX-3) * INDEX-5 does not have contentA1 7. C sees INDEX-5 and INDEX-5 and merge(INDEX-4,INDEX-5) contains contentA1 which is wrong, because A has been deleted (and there's no record of it anywhere in the system)

jkowalski · 2020-05-31T00:39:44Z

I'm going to disable automatic compaction that happens when repository is first opened and move it to maintenance.

…ch time by adding 32 random bytes

… all index blob IDs are different

repo/content/builder.go

repo/content/content_manager_indexes.go

repo/content/index_blob_manager.go

…n removed

julio-lopez · 2020-05-31T23:20:38Z

repo/content/index_blob_manager.go

@@ -306,6 +300,12 @@ func (m *indexBlobManagerImpl) deleteOldBlobs(ctx context.Context, latestBlob bl
 		return errors.Wrap(err, "unable to delete compaction logs")
 	}

+	compactionLogBlobsToDelayCleanup := m.findCompactionLogBlobsToDelayCleanup(ctx, compactionBlobs)


Just a minor nit: we don't need to write cleanup blobs for compaction blobs that already had a cleanup blob and will end up being deleted below (line 313)

From https://github.com/julio-lopez/kopia/pull/6/files

julio-lopez

LGTM ✅
🚢it
🚀

jkowalski requested a review from julio-lopez May 12, 2020 03:47

jkowalski force-pushed the compaction-log branch from 1568726 to 51ab247 Compare May 12, 2020 04:13

julio-lopez reviewed May 16, 2020

View reviewed changes

jkowalski marked this pull request as draft May 18, 2020 19:49

jkowalski added 6 commits May 18, 2020 21:38

content: updated list cache to cache both n and m

61ab183

content: various linter fixes

3779dc2

- pass CachingOptions by pointer, since it's already pretty big - split content.NewManager() into two functions

repo: fixed cache clear on windows

412ba25

Clearing cache requires closing repository first, as Windows is holding the files locked. This requires ability to close the repositroy twice.

content: refactored index blob management into indexBlobManager

866956a

jkowalski force-pushed the compaction-log branch from 2625075 to 866956a Compare May 19, 2020 05:09

jkowalski added 7 commits May 20, 2020 10:10

testing: enhanced eventually consistent storage wrapper

f199829

list() can now show recently deleted items.

testing: fixed blobtesting.Map storage to allow overwrites

afdee5f

fixed own-writes cache expulsion to use correct time

f5ab6a7

blob: added debug output String() to blob.Metadata

7feec5d

testing: fixed logging output from logging storage wrapper

43dd748

index blob manager: testing and bugfixes

0e4e931

lint: simplified deleteOldIndexBlobs

c254cc7

jkowalski marked this pull request as ready for review May 22, 2020 03:34

testing: improved eventually consistent storage behavior

34c7862

jkowalski force-pushed the compaction-log branch from ce3726f to 34c7862 Compare May 25, 2020 18:18

jkowalski added 5 commits May 25, 2020 21:02

testing: limit the amount of wall clock time spent in TestIndexBlobMa…

a8c0d8f

…nagerStress

makefile: increase test timeout

891acf3

testing: tie number of actors to # of CPUs in the system in TestIndex…

316a577

…BlobManagerStress

testing: TestIndexBlobManagerStress fixes

c0bfe9c

julio-lopez reviewed May 27, 2020

View reviewed changes

julio-lopez reviewed May 28, 2020

View reviewed changes

repo/content/index_blob_manager_test.go Outdated Show resolved Hide resolved

internal/blobtesting/eventually_consistent.go Show resolved Hide resolved

julio-lopez approved these changes May 28, 2020

View reviewed changes

repo/content/index_blob_manager_test.go Outdated Show resolved Hide resolved

pkj415 reviewed May 28, 2020

View reviewed changes

repo/content/index_blob_manager.go Outdated Show resolved Hide resolved

pkj415 reviewed May 28, 2020

View reviewed changes

repo/content/content_manager_indexes.go Outdated Show resolved Hide resolved

pkj415 reviewed May 28, 2020

View reviewed changes

repo/content/index_blob_manager.go Outdated Show resolved Hide resolved

pkj415 reviewed May 28, 2020

View reviewed changes

repo/content/index_blob_manager.go Outdated Show resolved Hide resolved

addressed some PR feedback

0c55dad

pkj415 reviewed May 28, 2020

View reviewed changes

julio-lopez approved these changes May 28, 2020

View reviewed changes

repo/content/content_manager_own_writes.go Outdated Show resolved Hide resolved

Julio López and others added 3 commits May 28, 2020 12:40

Trivial comment rephrase.

5832d9b

trivial: moved compactionLogEntry and related to index_blob_manager.go

a1d90b3

jkowalski force-pushed the compaction-log branch from a7343e2 to 8fb64ad Compare May 29, 2020 03:26

jkowalski added 5 commits May 28, 2020 20:34

applied renames suggested in PR code review

b6aa987

testing: additional case for the stress test

39b810f

testing: fixed eventual consistency wrapper

1746a68

testing: removed flaky test that wasn't needed

b78f5b7

jkowalski added 3 commits May 30, 2020 19:02

content: when building pack index ensure index bytes are different ea…

8cde101

…ch time by adding 32 random bytes

testing: when building fake pack index, embed random number to ensure…

60887a1

… all index blob IDs are different

testing: do not ignore ErrBlobNotFound in the stress test

6e919c2

julio-lopez reviewed May 31, 2020

View reviewed changes

repo/content/builder.go Show resolved Hide resolved

repo/content/content_manager_indexes.go Show resolved Hide resolved

repo/content/index_blob_manager.go Outdated Show resolved Hide resolved

repo/content/index_blob_manager.go Outdated Show resolved Hide resolved

content: only write compaction blob after superseded indexes have bee…

83f2688

…n removed

julio-lopez reviewed May 31, 2020

View reviewed changes

Use the compaction log timestamp when deciding to delete index blobs

4e4857f

From https://github.com/julio-lopez/kopia/pull/6/files

julio-lopez approved these changes May 31, 2020

View reviewed changes

jkowalski merged commit d68273a into kopia:master Jun 1, 2020

snyk-bot mentioned this pull request May 5, 2023

[Snyk] Upgrade postcss-cli from 7.1.2 to 10.1.0 #3021

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements for dealing with eventually-consistent stores (S3) #437

Improvements for dealing with eventually-consistent stores (S3) #437

jkowalski commented May 12, 2020 •

edited

Loading

julio-lopez commented May 16, 2020

julio-lopez left a comment

pkj415 May 28, 2020

jkowalski May 31, 2020

pkj415 May 28, 2020

jkowalski commented May 31, 2020 •

edited

Loading

jkowalski commented May 31, 2020

julio-lopez May 31, 2020

julio-lopez left a comment •

edited

Loading

Improvements for dealing with eventually-consistent stores (S3) #437

Improvements for dealing with eventually-consistent stores (S3) #437

Conversation

jkowalski commented May 12, 2020 • edited Loading

julio-lopez commented May 16, 2020

julio-lopez left a comment

Choose a reason for hiding this comment

pkj415 May 28, 2020

Choose a reason for hiding this comment

jkowalski May 31, 2020

Choose a reason for hiding this comment

pkj415 May 28, 2020

Choose a reason for hiding this comment

jkowalski commented May 31, 2020 • edited Loading

jkowalski commented May 31, 2020

julio-lopez May 31, 2020

Choose a reason for hiding this comment

julio-lopez left a comment • edited Loading

Choose a reason for hiding this comment

jkowalski commented May 12, 2020 •

edited

Loading

jkowalski commented May 31, 2020 •

edited

Loading

julio-lopez left a comment •

edited

Loading