storage controller: be more tolerant of a pageserver's startup time #7552

jcsp · 2024-04-30T08:15:05Z

We mark a node offline after MAX_UNAVAILABLE_INTERVAL_DEFAULT (30 seconds) of failures to respond to heartbeats.

We should be more generous during startup: when a pageserver sends us a re-attach request, we should tip off the heartbeater to be more generous. Currently the pageserver's processing of the re-attach respond can be quite time consuming.

This is similar to the k8s distinction between a readiness check and a status check: we should be more tolerant when waiting for readiness during startup, than when checking for responsiveness during normal runtime.

(The actual init_tenant_mgr slowness is addressed in #7553, but this ticket still stands: we should be more tolerant during startup than we are during normal operation.)

The text was updated successfully, but these errors were encountered:

jcsp · 2024-04-30T08:20:48Z

@VladLazar let's pick this up as part of the rolling restart work: we should flip the node into its more tolerant mode on:

Notification that it is draining (or a separate hook for "I'm about to shut this down"?)
Starting to handle re-attach

VladLazar · 2024-07-08T13:19:18Z

2024-07-08

storcon: make heartbeats restart aware #8222 destabilised a bunch of tests
I fixed a few, but it's a slow going process

jcsp added t/bug Issue Type: Bug c/storage/controller Component: Storage Controller labels Apr 30, 2024

jcsp mentioned this issue Apr 30, 2024

storage controller: test for large shard counts #7475

Merged

5 tasks

VladLazar self-assigned this Jul 1, 2024

VladLazar mentioned this issue Jul 1, 2024

storcon: make heartbeats restart aware #8222

Merged

5 tasks

VladLazar closed this as completed in #8222 Jul 25, 2024

VladLazar closed this as completed in 9c5ad21 Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage controller: be more tolerant of a pageserver's startup time #7552

storage controller: be more tolerant of a pageserver's startup time #7552

jcsp commented Apr 30, 2024 •

edited

Loading

jcsp commented Apr 30, 2024

VladLazar commented Jul 8, 2024

storage controller: be more tolerant of a pageserver's startup time #7552

storage controller: be more tolerant of a pageserver's startup time #7552

Comments

jcsp commented Apr 30, 2024 • edited Loading

jcsp commented Apr 30, 2024

VladLazar commented Jul 8, 2024

jcsp commented Apr 30, 2024 •

edited

Loading