Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexheader: keep pre-shutdown snapshot between restarts #8281

Merged

Conversation

narqo
Copy link
Contributor

@narqo narqo commented Jun 5, 2024

What this PR does

The PR updates indexheader in a way so ReaderPool preserved the blocks found in the pre-shutdown snapshot. That it, when ReaderPool persists its own set of blocks to lazy-loaded.json, it also amends the set with the blocks, it eagerly loaded (or still loading) on startup. The changes make sure we only persist "fresh" blocks, so idle blocks will be evicted from the snapshots, eventually.

Which issue(s) this PR fixes or relates to

Fixes #8255

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@narqo narqo requested a review from a team as a code owner June 5, 2024 12:54
@narqo narqo force-pushed the vldmr/indexheader-keep-snapshot-between-restarts branch from cbe2d95 to 63c1579 Compare June 5, 2024 12:57
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure sure about solving the problem this way. Have you considered guaranteeing that we never write the snapshot file while the store-gateway is starting up?

@narqo narqo force-pushed the vldmr/indexheader-keep-snapshot-between-restarts branch from 63c1579 to 486c033 Compare June 7, 2024 12:48
@narqo
Copy link
Contributor Author

narqo commented Jun 7, 2024

Have you considered guaranteeing that we never write the snapshot file while the store-gateway is starting up?

@pracucci @dimitarvdimitrov I've sketched how this can look like, following the discussion with Marco.

As noted, the primary challenge here is that BucketStore doesn't know, in general, if the ReaderPool it created needs sharpshoting (because ReaderPool can be not-lazy). Extracting snapshotter into BucketStore to restore the snapshots and spawn snapshotting after the initial sync was finished, breaks the existing abstractions. This isn't the end of world, imho, but doesn't make the package "clearer" either.

Also note, the tests are currently broken, since I didn't want to spend time on that, before validating the idea further.

@narqo narqo requested a review from pracucci June 7, 2024 12:57
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted, the primary challenge here is that BucketStore doesn't know, in general, if the ReaderPool it created needs sharpshoting (because ReaderPool can be not-lazy).

did I read the code right that now we just persist the list of blocks regardless whether they are lazy-loaded or not? I'm fine with this approach

pkg/storegateway/indexheader/snapshotter.go Outdated Show resolved Hide resolved
pkg/storegateway/indexheader/snapshotter.go Outdated Show resolved Hide resolved
@narqo narqo force-pushed the vldmr/indexheader-keep-snapshot-between-restarts branch from 486c033 to 925aa09 Compare June 12, 2024 09:19
@narqo
Copy link
Contributor Author

narqo commented Jun 12, 2024

@dimitarvdimitrov

did I read the code right that now we just persist the list of blocks regardless whether they are lazy-loaded or not?

Yes, BucketStore always starts its snapshotter, asking to persist the blocks of its index pool. Importantly, it only loads the previous snapshot on startup when index_header.eager_loading_startup_enabled is enabled.

@dimitarvdimitrov
Copy link
Contributor

Yes, BucketStore always starts its snapshotter, asking to persist the blocks of its index pool. Importantly, it only loads the previous snapshot on startup when index_header.eager_loading_startup_enabled is enabled.

this means that now if you enable lazy loading, the store-gateway will keep the blocks loaded from when lazy loading was disabled. I think it's worth mentioning this in the CLI flag for lazy loading. What do you think?

@dimitarvdimitrov
Copy link
Contributor

Yes, BucketStore always starts its snapshotter, asking to persist the blocks of its index pool. Importantly, it only loads the previous snapshot on startup when index_header.eager_loading_startup_enabled is enabled.

this means that now if you enable lazy loading, the store-gateway will keep the blocks loaded from when lazy loading was disabled. I think it's worth mentioning this in the CLI flag for lazy loading. What do you think?

I was wrong. We only persist the lazy loaded blocks. So upon a restart we'd just start with no eagerly loaded blocks, which is what happens today too.

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we're missing the case where a store-gateway discovers a tenant after starting up. Also that leaking goroutine isn't great, I'm not sure how hard it would be to solve it.

Comment on lines 246 to 253
s.indexHeadersSnapshotter = indexheader.NewSnapshotter(s.logger, snapConfig)

var lazyLoadedBlocks map[ulid.ULID]int64
if bucketStoreConfig.IndexHeader.EagerLoadingStartupEnabled {
lazyLoadedBlocks = s.indexHeadersSnapshotter.RestoreLoadedBlocks()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's starting to feel like the ReaderPool should be a service on its own with its own lifecycle (start, run, stop). This way we're leaking this to other components like the BucketStore.

But just now I saw that BucketStore and BucketStores doesn't manage their own lifecycle either (nothing calls Stop on them for example), so doing this just for ReaderPool probably won't bring too much value as-is

That said, I think it's nothing actionable unfortuantely

Comment on lines 371 to 373
if err := s.indexHeadersSnapshotter.Start(ctx, s.indexReaderPool); err != nil {
return errors.Wrap(err, "start index headers snapshotter")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is called in InitialSync which runs when the store-gateway starts up. What would happen if the tenant was discovered after the store-gateway completed its initial sync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, I didn't think about that. After judo'ing with the code for a bit, I didn't manage to find a better way to resolve that than with moving the call to snapshot.Start inside sync. Please, have look through the latest changes.

Comment on lines +31 to +33
// Similar to the one above, store-gateway BucketStore starts a goroutine in snapshotter
// that manages lazy-loaded snapshots. It stops the snapshotter on BucketStore close.
goleak.IgnoreTopFunction("github.com/grafana/mimir/pkg/storegateway/indexheader.(*Snapshotter).Start.func1"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The store is closed only when the tenant no longer shards to this store-gateway, which happens fairly rarely. This pushes me more towards having a proper lifecycle. That goroutine doesn't have a good reason to leak. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to have a proper lifecycle in BucketStore (and maybe BucketStores) so they managed the lifecycles of ReaderPool and Snapshotter? As you mentioned in #8281 (comment) (and what I'd bumped into when looked at it), this might be too large of a change for the expected gain.

I'm not opposed to doing that, though — refactoring can be fun 🕺

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep that was my idea. Make BucketStores maintain the lifecycle of the BucketStore instances that it spawns - starting them when they come up, shutting them down when the store-gateway process shuts down. And then doing the appropriate things on these triggers (like starting the snapshotter; maybe even doing the initial sync as part of starting, which would make it slightly cleaner to use the lazy loaded snapshot -- for that we might bring back the snapshotter under ReaderPool). Maybe we can start simple and just managing the snapshotter and readerPool from BucketStore (and transitively from BucketStores)?

But I agree it's some refactoring. I'm happy to merge this PR so we catch the 2.13 and r294 releases on Monday, as long as we figure out how to make it work for tenants which are discovered (the comment above). The rest is just code structure.

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing my feedback.

@pracucci do you want to take another look or can we merge this?

BucketStore lifecycle refactor

I'll be somewhat busy with the 2.13 release this week, so I am not sure about when I'll be able to start with the bucket store start/stop lifecycle refactor. In case you can start this earlier I opened this issue #8389 for wheover gets to it first

CHANGELOG.md Outdated Show resolved Hide resolved
@narqo narqo force-pushed the vldmr/indexheader-keep-snapshot-between-restarts branch from 7d8bb9b to 69ab5d6 Compare June 17, 2024 12:42
@pracucci
Copy link
Collaborator

@pracucci do you want to take another look or can we merge this?

I'm taking a quick look.

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Snapshotter design LGTM, and it should fix the issue we want to fix.

It's important that the snapshot filepath hasn't changed to keep it backward compatible (we want to reload the previous file after rolling out this change), but I think it didn't change, so LGTM.

I left a couple of comments. One is about the design in reader pool, which I think should be re-iterated on too, but it's definitely not a blocker for this PR. The other comment is about what happens if we fail to start the snapshotter. I'm not super convinced about how we handle it.

Not explicitly approving this PR because I haven't done a super accurate review. I defer to Dimitar on this.

logger log.Logger
conf SnapshotterConfig

stop chan struct{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't Snapshotter a services.Service? I think most of Snapshotter.Start() could go away simply using a timer service.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the problem is that you need to pass blocksLoader to Start(), which is something you can't do with services.Service. Having to pass blocksLoader is a design bad smell to me. The bad smell originates from the fact that we have to pass loadedBlocks to NewReaderPool().

Reader pool should be unaware of preShutdownLoadedBlocks. The loading of previously loaded blocks should triggered outside of the reader pool, and not in the reader pool itself.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't address this comment in this PR, but I think we should re-iterate on the design to further clean it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can use #8389 as a venue to address some of this

// We do that here so the snapshotter watched after blocks from both initial sync and those discovered later.
var err error
s.snapshotterStartOnce.Do(func() {
err = s.snapshotter.Start(ctx, s.indexReaderPool)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Start() fails, then the snapshotter will never run. Is it what we want?

Copy link
Contributor Author

@narqo narqo Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not trying to start a once failed snapshotter seems like a safe play:

  • for an initial sync this will propagate into fail of the service
  • for a block discovered at later stages, this will prevent saving the lazy-loaded snapshots for the block (and a log message)

In contrast, I'm not sure that I'd expect that snapshotter.Start can be called more than once per instance from a code-reader's POV.

I don't feel strong about that, though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've chatted about this one with @dimitarvdimitrov, and decided it's not worth it failing the Snapshotter.Start. The previous version never did it also. I've updated the code.

pkg/storegateway/indexheader/snapshotter.go Outdated Show resolved Hide resolved
@narqo narqo force-pushed the vldmr/indexheader-keep-snapshot-between-restarts branch from e1114e3 to 968e334 Compare June 18, 2024 12:08
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
narqo and others added 4 commits June 18, 2024 14:45
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Signed-off-by: Vladimir Varankin <vladimir.varankin@grafana.com>
Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
@narqo narqo force-pushed the vldmr/indexheader-keep-snapshot-between-restarts branch from 5ff5ffb to 8ce3f22 Compare June 18, 2024 12:45
@dimitarvdimitrov dimitarvdimitrov merged commit 865d165 into main Jun 19, 2024
29 checks passed
@dimitarvdimitrov dimitarvdimitrov deleted the vldmr/indexheader-keep-snapshot-between-restarts branch June 19, 2024 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store-gateway overwrites the lazy-loaded.json snapshot while it's still loading blocks at startup
3 participants