Generated Hashring to contain only the statefulset replicas in the Ready status #75

spaparaju · 2021-04-21T09:09:15Z

Currently when the statefulset is scaled (lower number to higher number), pods are being checked to be in ready condition before the respective endpoints are considered to be part of Hashring configMap. However in subsequent runs of thanos-receiver-controller loop, the entire of statefulset spec. of replicas (intended # of replicas) is considered for Hashring generation. This can be a problem if scale up fails to reach the intended # of replicas.

This PR :

ensures scale up of replicas (even if scale up does not reach intentended # of replicas ) result in generated Hashring to contain endpoints of replicas in the ready status
removes separate logic for scaling up to make sure all the thanos-receive-controller runs 'checks for pods to be in ready status' before their service endpoints are added to the Hashring.
Update tests to make sure the path for 'checking for pods to be in ready status' is tested

…ulset does not result in intended # of replicas

spaparaju · 2021-04-21T09:41:09Z

cc @metalmatze @squat

spaparaju · 2021-04-29T11:49:10Z

@metalmatze can I get a review for this PR. thanks!

squat · 2021-05-25T12:34:10Z

main.go

-				podName := fmt.Sprintf("%s-%d", sts.Name, i)
+		var endpoints []string
+		// Iterate over new replicas to wait until they are running
+		for i := 0; i < int(*sts.Spec.Replicas); i++ {


This seems like the right place to use a WaitGroup (https://gobyexample.com/waitgroups) and use a global timeout so we wait maximum t seconds and not n*t?

makes sense to parallelize and reduce the wait time. The slice 'endpoints' would need to be shared and updated across the go routines thru a Channel would have impact on the level parallelism.

What is the consensus on this are we going to use the waitGrouos?

I would vote for not using waitGroups in this case.

Employing waitGroups reduce the overall wait time (for pods to come up), but require handle synchronize append()s to a shared data structure across go routines.

The 1 min. is the maximum time that receive-controller wait for a pod to come up, i.e. if a pod does come up earlier than 1 min., 1 min. wait time does not occur.

With waitGroups - if the one of pods does take 1 min. time to come up, would negate any advantage of employing waitGroups.

metalmatze · 2021-05-26T13:44:18Z

I'm still not 100% sure if this is really needed. 🤔
The scale up checks for the individual pods to be ready before continuing with the scale up. So at the end of the iteration the Pods are all ready?
What exactly am I missing?

spaparaju · 2021-05-27T00:24:12Z

This fix covers a specific case - updating hash-ring if scale up does not reach desired # of replicas.
Without this fix, successive operator sync() loop updates the hash-ring (always ~= even if the replicas are not ready) with the 'desired the # of replicas' (spec.replicas)

kakkoyun

I'm happy to merge and test this on a real-life system if nobody has major objections.

bwplotka · 2021-06-24T13:42:48Z

I would love to test this out locally first within e2e. @ianbillett could you take a look (not write test, just look if that makes sense)

bill3tt

Given the logic here is pretty simple - e2e testing with envtest or kind feels too heavyweight. The fake client testing is appropriate for the complexity of this controller 👍

IMO we're only really testing static cases here, there is no definition of behaviour when the stateful set changes status.

main.go

bill3tt · 2021-06-28T10:54:04Z

main.go

-				start := time.Now()
-				podName := fmt.Sprintf("%s-%d", sts.Name, i)
+		var endpoints []string
+		// Iterate over new replicas to wait until they are running


I know you didn't write it initially - but this comment currently does not match the implementation:

We iterate over all replicas - not just new ones.

We don't wait for them to be running, we wait a maximum of 1 minute for each pod to become ready, and skip adding them to the hashring if they are not ready.

I know you didn't write it initially - but this comment currently does not match the implementation:

We iterate over all replicas - not just new ones.
Thats right. Updated the comment to clarify that we iterate thru all the replicas.

We don't wait for them to be running, we wait a maximum of 1 minute for each pod to become ready, and skip adding them to the hashring if they are not ready.
waitForPod() is no-op for replicas in 'Running' status and wait for 1 min. for non-ready to get to 'Ready' status. The non-ready replicas (after 1 min.) are not added to the Hashring.

bwplotka

Thanks for updating this. I still think making hashring changes on every unready pod can cause failover scenario.

For example:

If we have 7 replicas
7th is suddenly crashing for some reason
controller changing hashring to have only 6 replicas
it increases load on 6 replicas so e.g 6th is crashing too
now controller changes to 5 replicas, which brings more load to 5
whole system is on fire for all tenants.

WDYT? 🤔

I think we should wait with replica number increase on hashring until first pod is ready yes, but later unreadiness should be assumed as crash, no?

spaparaju · 2021-07-02T17:22:29Z

Thanks for updating this. I still think making hashring changes on every unready pod can cause failover scenario.

For example:

If we have 7 replicas

7th is suddenly crashing for some reason

controller changing hashring to have only 6 replicas

it increases load on 6 replicas so e.g 6th is crashing too

now controller changes to 5 replicas, which brings more load to 5

whole system is on fire for all tenants.

This scenario (one of the replicas in Ready status going down == not able to serve traffic) is a possible case, limiting the incoming traffic (based on replica count in Ready status) can be one solution. Role of hashring-controller (or plain Kube SS controller) is to contain / send traffic to the replica addresses which can serve traffic.

WDYT? 🤔

I think we should wait with replica number increase on hashring until first pod is ready yes, but later unreadiness should be assumed as crash, no?
Hash-ring would contain its first valid address only after the first SS pod is ready to serve the traffic.

bwplotka · 2021-07-07T16:44:09Z

To sum our offline discussion on our sync:

This change overall makes sense from the controller point of view. As mentioned above from network proxy it makes sense to provide the current situation to hashing allowing receive to deal with missing pods.
We don't have all measurements to stop the cascading failures, so we should not merge this PR until we have that. We are missing local per receive rate-limit based on current capacity e.g resources we use. The current gubernator helps us but it's very manual and has only statically set rate limiting per whole system

Next steps:

Add backoff rate limit per receive that can auto-tune based on resource limits
Merge this PR when we do step 1

bwplotka · 2021-07-07T16:59:59Z

Added ticket: thanos-io/thanos#4425

bwplotka

\hold as commented above.

spaparaju · 2021-07-07T20:25:24Z

To sum our offline discussion on our sync:

This change overall makes sense from the controller point of view. As mentioned above from network proxy it makes sense to provide the current situation to hashing allowing receive to deal with missing pods.

We don't have all measurements to stop the cascading failures, so we should not merge this PR until we have that. We are missing local per receive rate-limit based on current capacity e.g resources we use. The current gubernator helps us but it's very manual and has only statically set rate limiting per whole system

Next steps:

Add backoff rate limit per receive that can auto-tune based on resource limits

Merge this PR when we do step 1
backoff ratelimit per receive makes sense. thanks for creating an issue for tracking. Cascading failures, backoff rate-limiting are independent of this PR as cascading failures can happen even with the current master release of the receive-controller. (a failed write to a not-ready pod will be retried by a client to be passed onto to a ready pod)

bwplotka

I still believe there is a difference between adding new replica and restarting an existing replica. For 1000 tenants, we will start creating possibly 1000 new TSDBs across receivers for every even short restart of receive.

Also this attempt discussed on Slack states that such experiment (similar to this PR) failed due to larger memory consumption of receivers.

bwplotka · 2021-12-15T23:33:31Z

Still I will be experimenting with different ideas here. My plan is to also create local test scenarios using e2e to mimic controller and failure cases.

bwplotka · 2021-12-17T22:59:19Z

Early attempt already not working as expected: thanos-io/thanos#4968

spaparaju · 2021-12-21T13:24:05Z

There are two parts to the problem of ingestion failures:

Few replicas in the non-Ready status are sneaking into the hash-ring
Better, consistent hash-ring

This PR fixes (1) from the above list.

bill3tt · 2021-12-30T12:14:09Z

@spaparaju can we close this PR now that we have #80?

spaparaju · 2022-01-04T13:40:53Z

Yes Ian. Opened a new PR (#80) for fixing this bug. Closing this PR.

change hashring generation logic to accommodate the scaling of statef…

c35a38d

…ulset does not result in intended # of replicas

spaparaju force-pushed the reconcile-hashring branch from a001dcb to c35a38d Compare April 21, 2021 09:34

spaparaju mentioned this pull request Apr 21, 2021

Update Hash ring only with the in-ready-status replicas of Statefulset #70

Closed

kakkoyun requested review from squat and metalmatze April 21, 2021 10:48

spaparaju changed the title ~~Generated Hashring to contain only thre statefulset replicas in the Ready status~~ Generated Hashring to contain only the statefulset replicas in the Ready status Apr 29, 2021

squat reviewed May 25, 2021

View reviewed changes

Merge remote-tracking branch 'up/master' into reconcile-hashring

0f1ae6a

kakkoyun approved these changes Jun 11, 2021

View reviewed changes

bill3tt reviewed Jun 28, 2021

View reviewed changes

update a comment

e20a28a

bwplotka reviewed Jul 1, 2021

View reviewed changes

remove separate function for populating hashring

50602a0

bwplotka requested changes Jul 7, 2021

View reviewed changes

bwplotka requested changes Dec 15, 2021

View reviewed changes

spaparaju closed this Jan 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generated Hashring to contain only the statefulset replicas in the Ready status #75

Generated Hashring to contain only the statefulset replicas in the Ready status #75

spaparaju commented Apr 21, 2021 •

edited

Loading

spaparaju commented Apr 21, 2021

spaparaju commented Apr 29, 2021

squat May 25, 2021

spaparaju Jun 1, 2021

kakkoyun Jun 11, 2021

spaparaju Jun 14, 2021

metalmatze commented May 26, 2021

spaparaju commented May 27, 2021

kakkoyun left a comment

bwplotka commented Jun 24, 2021 •

edited

Loading

bill3tt left a comment

bill3tt Jun 28, 2021 •

edited

Loading

spaparaju Jun 28, 2021

bwplotka left a comment

spaparaju commented Jul 2, 2021

bwplotka commented Jul 7, 2021

bwplotka commented Jul 7, 2021

bwplotka left a comment

spaparaju commented Jul 7, 2021

bwplotka left a comment

bwplotka commented Dec 15, 2021

bwplotka commented Dec 17, 2021

spaparaju commented Dec 21, 2021

bill3tt commented Dec 30, 2021

spaparaju commented Jan 4, 2022

Generated Hashring to contain only the statefulset replicas in the Ready status #75

Generated Hashring to contain only the statefulset replicas in the Ready status #75

Conversation

spaparaju commented Apr 21, 2021 • edited Loading

spaparaju commented Apr 21, 2021

spaparaju commented Apr 29, 2021

squat May 25, 2021

Choose a reason for hiding this comment

spaparaju Jun 1, 2021

Choose a reason for hiding this comment

kakkoyun Jun 11, 2021

Choose a reason for hiding this comment

spaparaju Jun 14, 2021

Choose a reason for hiding this comment

metalmatze commented May 26, 2021

spaparaju commented May 27, 2021

kakkoyun left a comment

Choose a reason for hiding this comment

bwplotka commented Jun 24, 2021 • edited Loading

bill3tt left a comment

Choose a reason for hiding this comment

bill3tt Jun 28, 2021 • edited Loading

Choose a reason for hiding this comment

spaparaju Jun 28, 2021

Choose a reason for hiding this comment

bwplotka left a comment

Choose a reason for hiding this comment

spaparaju commented Jul 2, 2021

bwplotka commented Jul 7, 2021

bwplotka commented Jul 7, 2021

bwplotka left a comment

Choose a reason for hiding this comment

spaparaju commented Jul 7, 2021

bwplotka left a comment

Choose a reason for hiding this comment

bwplotka commented Dec 15, 2021

bwplotka commented Dec 17, 2021

spaparaju commented Dec 21, 2021

bill3tt commented Dec 30, 2021

spaparaju commented Jan 4, 2022

spaparaju commented Apr 21, 2021 •

edited

Loading

bwplotka commented Jun 24, 2021 •

edited

Loading

bill3tt Jun 28, 2021 •

edited

Loading