Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store-gateway drops all blocks if fails to heartbeat the ring #1805

Closed
pracucci opened this issue May 3, 2022 · 1 comment · Fixed by #1816 or #1823
Closed

Store-gateway drops all blocks if fails to heartbeat the ring #1805

pracucci opened this issue May 3, 2022 · 1 comment · Fixed by #1816 or #1823
Assignees
Labels
bug Something isn't working component/store-gateway

Comments

@pracucci
Copy link
Collaborator

pracucci commented May 3, 2022

This issue describes an issue we experienced yesterday on some store-gateways in a Mimir cluster.

Scenario:

  • Mimir configured to run ring on Consul (but the issue could happen with any ring backend, memberlist included)
  • Store-gateway is overloaded while lazy-loading a large number of big block's index-headers
  • Store-gateway fails to heartbeat the ring for X consecutive minutes, where X is > -store-gateway.sharding-ring.heartbeat-timeout
  • At the 1st next sync, the store-gateway detects itself as unhealthy and drop all its blocks
  • At the 2nd next sync, the store-gateway detects itself as healthy and re-adds all its blocks, causing additional load to the store-gateway itself

Zoom in into a specific store-gateway

The issue described above happened to several store-gateways in a large cluster.
To better understand the behaviour, I'm looking at a specific one: store-gateway-zone-b-3.

The sequence of related block syncs are:

level=info ts=2022-05-02T09:06:19.395846082Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-changeShow context
level=info ts=2022-05-02T09:07:03.365919167Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

# In this sync all blocks have been dropped.
level=info ts=2022-05-02T09:07:03.3659588Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:08:55.148949647Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

# In this sync all blocks have been re-added. The sync took 37 minutes.
level=info ts=2022-05-02T09:08:55.148992704Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.826814806Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

level=info ts=2022-05-02T09:45:34.826872774Z caller=gateway.go:288 msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2022-05-02T09:45:34.853222892Z caller=gateway.go:294 msg="successfully synchronized TSDB blocks for all users" reason=ring-change

Querying the metric cortex_consul_request_duration_seconds_count{namespace="REDACTED",pod="store-gateway-zone-b-3",operation="CAS",status_code="200"} we can see the number of successful CAS operations did not increase between 09:03:15 and 09:08:30. The metric value (when scraping was successful because store-gateway was overloaded) is always 2369 during that timeframe:

Screenshot 2022-05-03 at 10 17 04

The store-gateway is running with -store-gateway.sharding-ring.heartbeat-timeout=4m, the last successful heartbeating was before 09:03:00, so at minute 09:07:00 it was expired and unhealthy in the ring. The next sync started at 09:07:03, once the store-gateway was unhealthy in the ring.

The store-gateway has a protection to keep the previously loaded blocks in case it's unable to lookup the ring. However, it did successfully lookup the ring, but the self instance was detected as unhealthy:

if err != nil {
if _, ok := loaded[blockID]; ok {
level.Warn(logger).Log("msg", "failed to check block owner but block is kept because was previously loaded", "block", blockID.String(), "err", err)
} else {
level.Warn(logger).Log("msg", "failed to check block owner and block has been excluded because was not previously loaded", "block", blockID.String(), "err", err)
// Skip the block.
synced.WithLabelValues(shardExcludedMeta).Inc()
delete(metas, blockID)
}
continue
}

The store-gateway also has a protection to keep the previously loaded blocks if there's no other authoritative owner for that block in the ring, but that wasn't the case neither, because there were other owners:

// The block is not owned by the store-gateway. However, if it's currently loaded
// we can safely unload it only once at least 1 authoritative owner is available
// for queries.
if _, ok := loaded[blockID]; ok {
// The ring Get() returns an error if there's no available instance.
if _, err := r.Get(key, BlocksOwnerRead, bufDescs, bufHosts, bufZones); err != nil {
// Keep the block.
continue
}
}
// The block is not owned by the store-gateway and there's at least 1 available
// authoritative owner available for queries, so we can filter it out (and unload
// it if it was loaded).
synced.WithLabelValues(shardExcludedMeta).Inc()
delete(metas, blockID)

So all the blocks have been unloaded, and then (at the 2nd next sync) progressively loaded back.

Proposal

I propose to add a simple additional protection to the store-gateway which doesn't drop any loaded block if the store-gateway itself is unhealthy in the ring.

The idea is that if the store-gateway itself is detected as unhealthy in the ring, it shouldn't do anything during the periodic sync, neither add or drop blocks, because it's in a inconsistent situation: the store-gateway is obviously running, but detected as unhealthy in the ring.

@pracucci
Copy link
Collaborator Author

pracucci commented May 3, 2022

My plan to give a try to the proposed enhancement:

  • Reproduce locally
  • Fix it
  • Try again locally
  • Write unit tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/store-gateway
Projects
None yet
1 participant