Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements for dealing with eventually-consistent stores (S3) #437

Merged
merged 34 commits into from
Jun 1, 2020

Conversation

jkowalski
Copy link
Contributor

@jkowalski jkowalski commented May 12, 2020

  • added compaction log which defers deletion of old indexes until later
  • added cache of local writes which is merged with backend List results

Fixes #326

repo/content/content_manager_indexes.go Outdated Show resolved Hide resolved
repo/content/content_manager_indexes.go Outdated Show resolved Hide resolved
repo/content/content_manager_indexes.go Outdated Show resolved Hide resolved
repo/content/content_manager_indexes.go Outdated Show resolved Hide resolved
@julio-lopez
Copy link
Collaborator

@pkj415 PTAL

@jkowalski jkowalski marked this pull request as draft May 18, 2020 19:49
Instead of compaction immediately deleting source index blobs, we now
write log entries (with `m` prefix) which are merged on reads
and applied only if the blob list includes all inputs and outputs, in
which case the inputs are discarded since they are known to have been
superseded by the outputs.

This addresses eventual consistency issues in stores such as S3,
which don't guarantee list-after-put or list-after-delete. With such
stores the repository is ultimately eventually consistent and there's
not much that can be done about it, unless we use second strongly
consistent storage (such as GCS) for the index only.
Thi keeps track of which blobs (n and m) have been written by the
local repository client, so that even if the storage listing
is eventually consistent (as in S3), we get somewhat sane behavior.

Note that this is still assumming read-after-create semantics, which
S3 also guarantees, otherwise it's very hard to do anything useful.
- pass CachingOptions by pointer, since it's already pretty big
- split content.NewManager() into two functions
Clearing cache requires closing repository first, as Windows is holding
the files locked.

This requires ability to close the repositroy twice.
@jkowalski jkowalski marked this pull request as ready for review May 22, 2020 03:34
This works by using N parallel "actors", each repeatedly performing
operations on indexBlobManagers all sharing single eventually consistent
storage.

Each actor runs in a loop and randomly selects between:

- *reading* all contents in indexes and verifying that it includes
  all contents written by the actor so far and that contents are
  correctly marked as deleted
- *creating* new contents
- *deleting* one of previously-created contents (by the same actor)
- *compacting* all index files into one

The test runs on accelerated time (every read of time moves it by 0.1
seconds) and simulates several hours of running.

In case of a failure, the log should provide enough debugging
information to trace the exact sequence of events leading up to the
failure - each log line is prefixed with actorID and all storage
access is logged.
repo/content/content_manager_lock_free.go Outdated Show resolved Hide resolved
repo/content/content_manager_own_writes.go Show resolved Hide resolved
repo/content/content_manager_indexes.go Outdated Show resolved Hide resolved
repo/content/content_manager_indexes.go Outdated Show resolved Hide resolved
repo/content/index_blob_manager.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@julio-lopez julio-lopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strictly better, that is, it improves the robustness when handling eventually consistent stores in so many ways. It makes sense to merge this PR as is and address the remaining issues and comments in a separate PR.

@jkowalski WDYT?

repo/content/index_blob_manager_test.go Outdated Show resolved Hide resolved
return results, nil
}

func (m *indexBlobManagerImpl) deleteOldIndexBlobs(ctx context.Context, latestBlob blob.Metadata) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we be having anything in maintenance runs to delete old index blobs, log compaction entries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add it, sure, but assuming we do semi-regular compaction things will clean themselves up.

return nil, errors.Wrap(err, "error merging local writes for compaction log entries")
}

storageIndexBlobs, err := m.listCache.listIndexBlobs(ctx, indexBlobPrefix)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making two calls to listIndexBlobs with different prefixes one after the other is an issue here, consider this -

  1. As part of some process X, the first call to list compaction logs gave us the blobs existing.
  2. The process X then stalled for some reason for a long time (hours) before listing all index blobs.
  3. Another process Y came in and wrote a compaction log blob, compacting some index blobs say A and B.
  4. After some time, another process Z, removed index blobs A and B.
  5. Just after process Z removed A, process X resumed again to now read index blob B. This is an issue as it did not read the earlier compaction blob that was written.

Maybe we should first list all index blobs, and then all compaction log blobs.

repo/content/content_manager_own_writes.go Outdated Show resolved Hide resolved
Julio López and others added 3 commits May 28, 2020 12:40
It's actually not needed:

If you do compaction at time `t0`, it reads some index files (say `Na`, `Nb`, `Nc`)
and creates new blob `Nn` and log entry `Mn` which says:

```
{
    inputs: ["Na","Nb","Nc"],
    outputs: ["Nn"]
}
```

We now have two cases depending on time:

a) between [t0,t0 + minIndexBlobDeleteAge):

The eventual consistency for both `Nn` and `Mn` is at play - they may or may not be observed by readers.

The original writer will see `Nn` and `Mn` due to own-writes cache.

Other readers will not necessarily see `Nn` and/or `Mn` but they will for sure see `Na`, `Nb`, `Nc` because we have not deleted them yet.

Ultimately repository will read and merge the indexes and:

```
merge(Na, Nb, Nc) == merge(Na,Nb,Nc,Nn)
```

b) after t0 + minIndexBlobDeleteAge

During this time we are free to delete `Na`, `Nb`, and `Nc` and 'Mn',
but that's ok since we've waited enough time for `Nn` to be visible to all readers.

The readers will see `Nn` and possibly some subset of {`Na`, `Nb`, `Nc`}, but that's ok since

```
merge(Na, Nb, Nc, Nn) == merge(Nn)
merge(Na, Nb, Nn) == merge(Nn)
merge(Na, Nc, Nn) == merge(Nn)
merge(Nb, Nc, Nn) == merge(Nn)
merge(Na, Nn) == merge(Nn)
merge(Nb, Nn) == merge(Nn)
merge(Nc, Nn) == merge(Nn)
```
The race is where if we delete compaction log too early, it may lead to
previously deleted contents becoming temporarily live again to an
outside observer.

Added test case that reproduces the issue, verified that it fails
without the fix and passed with one.
@jkowalski
Copy link
Contributor Author

jkowalski commented May 31, 2020

I finally discovered what's wrong with the stress test - it boils down to unsafe concurrent compaction done by two actors:

The sequence:

  1. A creates contentA1 in INDEX-1
  2. B creates contentB1 in INDEX-2
  3. A deletes contentA1 in INDEX-3
  4. B does compaction, but is not seeing INDEX-3 (due to EC), so it writes INDEX-4==merge(INDEX-1,INDEX-2)
    INDEX-4 has contentA1 as active
  5. A does compaction but it's not seeing INDEX-4 yet (due to EC), so it drops contentA1, writes INDEX-5=merge(INDEX-1,INDEX-2,INDEX-3)
    INDEX-5 does not have contentA1
  6. C sees INDEX-5 and INDEX-5 and merge(INDEX-4,INDEX-5) contains contentA1 which is wrong, because A has been deleted (and there's no record of it anywhere in the system)

- better logging to be able to trace the root cause in case of a failure
- prevented concurrent compaction which is unsafe:

The sequence:

1. A creates contentA1 in INDEX-1
2. B creates contentB1 in INDEX-2
3. A deletes contentA1 in INDEX-3
4. B does compaction, but is not seeing INDEX-3 (due to EC or simply
   because B started read before #3 completed), so it writes
   INDEX-4==merge(INDEX-1,INDEX-2)
   * INDEX-4 has contentA1 as active
5. A does compaction but it's not seeing INDEX-4 yet (due to EC
   or because read started before #4), so it drops contentA1, writes
   INDEX-5=merge(INDEX-1,INDEX-2,INDEX-3)
   * INDEX-5 does not have contentA1
7. C sees INDEX-5 and INDEX-5 and merge(INDEX-4,INDEX-5)
   contains contentA1 which is wrong, because A has been deleted
   (and there's no record of it anywhere in the system)
@jkowalski
Copy link
Contributor Author

I'm going to disable automatic compaction that happens when repository is first opened and move it to maintenance.

repo/content/builder.go Show resolved Hide resolved
repo/content/content_manager_indexes.go Show resolved Hide resolved
repo/content/index_blob_manager.go Outdated Show resolved Hide resolved
repo/content/index_blob_manager.go Outdated Show resolved Hide resolved
@@ -306,6 +300,12 @@ func (m *indexBlobManagerImpl) deleteOldBlobs(ctx context.Context, latestBlob bl
return errors.Wrap(err, "unable to delete compaction logs")
}

compactionLogBlobsToDelayCleanup := m.findCompactionLogBlobsToDelayCleanup(ctx, compactionBlobs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor nit: we don't need to write cleanup blobs for compaction blobs that already had a cleanup blob and will end up being deleted below (line 313)

Copy link
Collaborator

@julio-lopez julio-lopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅
🚢it :shipit:
🚀

@jkowalski jkowalski merged commit d68273a into kopia:master Jun 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add testing for eventual consistency behavior of blob storage providers
3 participants