Ruler, Querier, AM: Increase frequency of gRPC healthchecks #3168

dimitarvdimitrov · 2022-10-07T18:08:51Z

This increases the frequency of gRPC client healthchecks. These
healthchecks run asynchronously from gRPC requests. If the healthchecks
fail, all requests using this connection are interrupted.

This will help in cases when the store-gateways become
unresponsive for prolonged periods of time. The querier will more
quickly remove the connection to that store-gateways and retry on
another store-gateway.

Does the same for the ruler-ruler and alertmanager-alertmanager clients
for good measure.

Checklist

NA Tests updated
NA Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pstibrany · 2022-10-10T07:03:18Z

The querier will more
quickly remove the connection to that store-gateways. So fewer new
queries will go to it and existing queries will fail faster.

Querier uses store-gateway ring to find which store-gateways should be queried. If gRPC connection doesn't exist in the pool, querier will try to create one (all of that happens in this function). I don't quite see how closing connections faster will change querier's logic for finding which store-gateways to query.

osg-grafana · 2022-10-10T08:49:16Z

Leaving technical approval to an engineer. Approved from a docs perspective to get this off my review list. :)

pkg/querier/store_gateway_client.go

dimitarvdimitrov · 2022-10-10T10:23:21Z

The querier will more
quickly remove the connection to that store-gateways. So fewer new
queries will go to it and existing queries will fail faster.

Querier uses store-gateway ring to find which store-gateways should be queried. If gRPC connection doesn't exist in the pool, querier will try to create one (all of that happens in this function). I don't quite see how closing connections faster will change querier's logic for finding which store-gateways to query.

yeah that's a good point. It will only be forced to reestablish a new connection. New connection dialing is not bounded by a timeout unfortunately. We can use DialContext with a context with an expiration and also use WithBlock() here:

mimir/pkg/querier/store_gateway_client.go

Line 45 in da636a6

conn, err := grpc.Dial(addr, opts...)

WDYT @pstibrany ?

pstibrany · 2022-10-10T11:02:32Z

We can use DialContext with a context with an expiration and also use WithBlock() here:

This sounds like a good idea, but will require some changes to the client pool. We already get similar benefit from using time-bound context for individual calls (if connection doesn't exist yet, it is created first during call to eg. Series or other methods on StoreGatewayClient interface).

dimitarvdimitrov · 2022-10-10T11:28:42Z

This sounds like a good idea, but will require some changes to the client pool.

The client factory is within this package, so there's no need to change dskit if that's what you meant. I pushed a change in 9847f9d

pkg/querier/store_gateway_client.go

pstibrany · 2022-10-11T06:55:49Z

Previously blocksStoreReplicationSet.GetClientsFor function would return fast, returning map of clients to ULIDs. But not all store-gateways will necessarily by called by client of this function -- store-gateways are queried randomly, so user may be lucky and not hit the "bad" store-gateway.

Now blocksStoreReplicationSet.GetClientsFor will try to establish connection to store-gateways first. By doing that, all users will suffer if connection to store-gateway cannot be established in time, while previously some users might have been lucky and still get a response.

dimitarvdimitrov · 2022-10-11T10:24:37Z

Previously blocksStoreReplicationSet.GetClientsFor function would return fast, returning map of clients to ULIDs. But not all store-gateways will necessarily by called by client of this function -- store-gateways are queried randomly, so user may be lucky and not hit the "bad" store-gateway.

Now blocksStoreReplicationSet.GetClientsFor will try to establish connection to store-gateways first. By doing that, all users will suffer if connection to store-gateway cannot be established in time, while previously some users might have been lucky and still get a response.

This sounds like we sacrifice the redundancy of store-gateways.

so we added the WithBlock and DialContext with a timeout context so that we don't try to reestablish new connections all the time after the health check fails. How about we continue using DialContext but remove WithBlock? This way we will effectively be retrying to reestablish a connection to a failed store-gateway every $timeout seconds¹ until the store-gateway also fails its ring healthcheck.

Assuming a continuous flow of queries ↩

pstibrany · 2022-10-11T10:44:51Z

How about we continue using DialContext but remove WithBlock?

This combination doesn't have the desired effect. From DialContext documentation:

// In the non-blocking case, the ctx does not act against the connection. It
// only controls the setup steps.

We need to configure timeout on context that will be used to create connection, which for non-blocking clients is first call to client methods.

dimitarvdimitrov · 2022-10-11T12:12:08Z

I couldn't find any other way to limit to bound the connection time with the grpc libs.

We can override the behaviour on storeGatewayClient to take care of dialing with block and with context on the first call to the remote instead of in dialStoreGatewayClient.

I'm trying to think about what we're trying to solve with these dial timeouts. I went back to your previous comment @pstibrany

I don't quite see how closing connections faster will change querier's logic for finding which store-gateways to query.

It will fail requests to the store-gateways faster. Which means that they will be retried faster - the querier excludes store gateways which it already tried to query from retries. So if instead of waiting for 30s and then trying again, it will wait for 5s and then try again. So 3 tries are a total of 15s, not 1m30s (on average).

If all of this is correct, then why do we need to bound the dial time? The client pool closes and removes a connection after it exceeds its healthcheck timeout. It will do the same on a yet uninitialized connection. So this will also break any outstanding calls using the connection (like a hung Series call).

dimitarvdimitrov · 2022-10-28T08:18:17Z

@pstibrany @pracucci can you take another look please?

osg-grafana

Doc-related content is fine.

dimitarvdimitrov · 2022-11-21T09:22:33Z

@pstibrany, @pracucci can you take another look? I changed to PR to only have the gRPC client timeouts as we spoke IRL.

We also spoke about reducing the heartbeat timeout in the ring. I will first try this out internally first before changing the defaults.

pstibrany

Thank you!

pracucci

LGTM, but can you do it in the alertmanager too (pkg/alertmanager/alertmanager_client.go, please?

This increases the frequency of gRPC client healthchecks. These healthchecks run asynchronously from gRPC requests. If the healthchecks fail, all requests using this connection are interrupted. This will help in cases when the store-gateways become unresponsive for prolonged periods of time. The querier will more quickly remove the connection to that store-gateways and retry on another store-gateway. Does the same for the ruler-ruler and alertmanager-alertmanager clients for good measure. Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

dimitarvdimitrov · 2022-11-21T11:15:30Z

Done, also added a changelog entry

…3168) This increases the frequency of gRPC client healthchecks. These healthchecks run asynchronously from gRPC requests. If the healthchecks fail, all requests using this connection are interrupted. This will help in cases when the store-gateways become unresponsive for prolonged periods of time. The querier will more quickly remove the connection to that store-gateways and retry on another store-gateway. Does the same for the ruler-ruler and alertmanager-alertmanager clients for good measure. Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

dimitarvdimitrov added component/querier component/ruler labels Oct 7, 2022

dimitarvdimitrov requested review from a team and osg-grafana as code owners October 7, 2022 18:08

krajorama reviewed Oct 10, 2022

View reviewed changes

pkg/querier/store_gateway_client.go Outdated Show resolved Hide resolved

pracucci reviewed Oct 10, 2022

View reviewed changes

pkg/querier/store_gateway_client.go Outdated Show resolved Hide resolved

osg-grafana requested review from a team and removed request for osg-grafana November 16, 2022 14:46

osg-grafana approved these changes Nov 16, 2022

View reviewed changes

dimitarvdimitrov force-pushed the dimitar/increase-frequency-grpc-client-pool-helathcheck branch from 58c88e8 to 01c13b0 Compare November 21, 2022 09:19

dimitarvdimitrov requested review from pracucci and pstibrany November 21, 2022 09:22

pstibrany approved these changes Nov 21, 2022

View reviewed changes

pracucci approved these changes Nov 21, 2022

View reviewed changes

dimitarvdimitrov force-pushed the dimitar/increase-frequency-grpc-client-pool-helathcheck branch from 01c13b0 to 4c0db15 Compare November 21, 2022 11:11

dimitarvdimitrov force-pushed the dimitar/increase-frequency-grpc-client-pool-helathcheck branch from 4c0db15 to 1521d01 Compare November 21, 2022 11:15

pracucci enabled auto-merge (squash) November 21, 2022 11:16

dimitarvdimitrov changed the title ~~Ruler, Querier: Increase frequency of gRPC healthchecks~~ Ruler, Querier, AM: Increase frequency of gRPC healthchecks Nov 21, 2022

pracucci merged commit e4c7d3c into main Nov 21, 2022

pracucci deleted the dimitar/increase-frequency-grpc-client-pool-helathcheck branch November 21, 2022 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruler, Querier, AM: Increase frequency of gRPC healthchecks #3168

Ruler, Querier, AM: Increase frequency of gRPC healthchecks #3168

dimitarvdimitrov commented Oct 7, 2022 •

edited

Loading

pstibrany commented Oct 10, 2022 •

edited

Loading

osg-grafana commented Oct 10, 2022 •

edited

Loading

dimitarvdimitrov commented Oct 10, 2022

pstibrany commented Oct 10, 2022

dimitarvdimitrov commented Oct 10, 2022

pstibrany commented Oct 11, 2022

dimitarvdimitrov commented Oct 11, 2022

pstibrany commented Oct 11, 2022

dimitarvdimitrov commented Oct 11, 2022 •

edited

Loading

dimitarvdimitrov commented Oct 28, 2022

osg-grafana left a comment

dimitarvdimitrov commented Nov 21, 2022

pstibrany left a comment

pracucci left a comment

dimitarvdimitrov commented Nov 21, 2022

Ruler, Querier, AM: Increase frequency of gRPC healthchecks #3168

Ruler, Querier, AM: Increase frequency of gRPC healthchecks #3168

Conversation

dimitarvdimitrov commented Oct 7, 2022 • edited Loading

Checklist

pstibrany commented Oct 10, 2022 • edited Loading

osg-grafana commented Oct 10, 2022 • edited Loading

dimitarvdimitrov commented Oct 10, 2022

pstibrany commented Oct 10, 2022

dimitarvdimitrov commented Oct 10, 2022

pstibrany commented Oct 11, 2022

dimitarvdimitrov commented Oct 11, 2022

Footnotes

pstibrany commented Oct 11, 2022

dimitarvdimitrov commented Oct 11, 2022 • edited Loading

dimitarvdimitrov commented Oct 28, 2022

osg-grafana left a comment

Choose a reason for hiding this comment

dimitarvdimitrov commented Nov 21, 2022

pstibrany left a comment

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

dimitarvdimitrov commented Nov 21, 2022

dimitarvdimitrov commented Oct 7, 2022 •

edited

Loading

pstibrany commented Oct 10, 2022 •

edited

Loading

osg-grafana commented Oct 10, 2022 •

edited

Loading

dimitarvdimitrov commented Oct 11, 2022 •

edited

Loading