Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated Hashring to contain only the statefulset replicas in the Ready status #75

Closed
wants to merge 4 commits into from

Conversation

spaparaju
Copy link

@spaparaju spaparaju commented Apr 21, 2021

Currently when the statefulset is scaled (lower number to higher number), pods are being checked to be in ready condition before the respective endpoints are considered to be part of Hashring configMap. However in subsequent runs of thanos-receiver-controller loop, the entire of statefulset spec. of replicas (intended # of replicas) is considered for Hashring generation. This can be a problem if scale up fails to reach the intended # of replicas.

This PR :

  • ensures scale up of replicas (even if scale up does not reach intentended # of replicas ) result in generated Hashring to contain endpoints of replicas in the ready status
  • removes separate logic for scaling up to make sure all the thanos-receive-controller runs 'checks for pods to be in ready status' before their service endpoints are added to the Hashring.
  • Update tests to make sure the path for 'checking for pods to be in ready status' is tested

…ulset does not result in intended # of replicas
@spaparaju
Copy link
Author

cc @metalmatze @squat

@spaparaju
Copy link
Author

@metalmatze can I get a review for this PR. thanks!

@spaparaju spaparaju changed the title Generated Hashring to contain only thre statefulset replicas in the Ready status Generated Hashring to contain only the statefulset replicas in the Ready status Apr 29, 2021
podName := fmt.Sprintf("%s-%d", sts.Name, i)
var endpoints []string
// Iterate over new replicas to wait until they are running
for i := 0; i < int(*sts.Spec.Replicas); i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the right place to use a WaitGroup (https://gobyexample.com/waitgroups) and use a global timeout so we wait maximum t seconds and not n*t?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to parallelize and reduce the wait time. The slice 'endpoints' would need to be shared and updated across the go routines thru a Channel would have impact on the level parallelism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the consensus on this are we going to use the waitGrouos?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for not using waitGroups in this case.

  • Employing waitGroups reduce the overall wait time (for pods to come up), but require handle synchronize append()s to a shared data structure across go routines.
  • The 1 min. is the maximum time that receive-controller wait for a pod to come up, i.e. if a pod does come up earlier than 1 min., 1 min. wait time does not occur.
  • With waitGroups - if the one of pods does take 1 min. time to come up, would negate any advantage of employing waitGroups.

@metalmatze
Copy link
Contributor

I'm still not 100% sure if this is really needed. 🤔
The scale up checks for the individual pods to be ready before continuing with the scale up. So at the end of the iteration the Pods are all ready?
What exactly am I missing?

@spaparaju
Copy link
Author

This fix covers a specific case - updating hash-ring if scale up does not reach desired # of replicas.
Without this fix, successive operator sync() loop updates the hash-ring (always ~= even if the replicas are not ready) with the 'desired the # of replicas' (spec.replicas)

Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to merge and test this on a real-life system if nobody has major objections.

@bwplotka
Copy link
Member

bwplotka commented Jun 24, 2021

I would love to test this out locally first within e2e. @ianbillett could you take a look (not write test, just look if that makes sense)

Copy link

@bill3tt bill3tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the logic here is pretty simple - e2e testing with envtest or kind feels too heavyweight. The fake client testing is appropriate for the complexity of this controller 👍

IMO we're only really testing static cases here, there is no definition of behaviour when the stateful set changes status.

main.go Outdated Show resolved Hide resolved
main.go Outdated
start := time.Now()
podName := fmt.Sprintf("%s-%d", sts.Name, i)
var endpoints []string
// Iterate over new replicas to wait until they are running
Copy link

@bill3tt bill3tt Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't write it initially - but this comment currently does not match the implementation:

  1. We iterate over all replicas - not just new ones.
  2. We don't wait for them to be running, we wait a maximum of 1 minute for each pod to become ready, and skip adding them to the hashring if they are not ready.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't write it initially - but this comment currently does not match the implementation:

  1. We iterate over all replicas - not just new ones.
    Thats right. Updated the comment to clarify that we iterate thru all the replicas.
  2. We don't wait for them to be running, we wait a maximum of 1 minute for each pod to become ready, and skip adding them to the hashring if they are not ready.
    waitForPod() is no-op for replicas in 'Running' status and wait for 1 min. for non-ready to get to 'Ready' status. The non-ready replicas (after 1 min.) are not added to the Hashring.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this. I still think making hashring changes on every unready pod can cause failover scenario.

For example:

  • If we have 7 replicas
  • 7th is suddenly crashing for some reason
  • controller changing hashring to have only 6 replicas
  • it increases load on 6 replicas so e.g 6th is crashing too
  • now controller changes to 5 replicas, which brings more load to 5
  • whole system is on fire for all tenants.

WDYT? 🤔

I think we should wait with replica number increase on hashring until first pod is ready yes, but later unreadiness should be assumed as crash, no?

@spaparaju
Copy link
Author

Thanks for updating this. I still think making hashring changes on every unready pod can cause failover scenario.

For example:

  • If we have 7 replicas
  • 7th is suddenly crashing for some reason
  • controller changing hashring to have only 6 replicas
  • it increases load on 6 replicas so e.g 6th is crashing too
  • now controller changes to 5 replicas, which brings more load to 5
  • whole system is on fire for all tenants.

This scenario (one of the replicas in Ready status going down == not able to serve traffic) is a possible case, limiting the incoming traffic (based on replica count in Ready status) can be one solution. Role of hashring-controller (or plain Kube SS controller) is to contain / send traffic to the replica addresses which can serve traffic.

WDYT? 🤔

I think we should wait with replica number increase on hashring until first pod is ready yes, but later unreadiness should be assumed as crash, no?
Hash-ring would contain its first valid address only after the first SS pod is ready to serve the traffic.

@bwplotka
Copy link
Member

bwplotka commented Jul 7, 2021

To sum our offline discussion on our sync:

  • This change overall makes sense from the controller point of view. As mentioned above from network proxy it makes sense to provide the current situation to hashing allowing receive to deal with missing pods.
  • We don't have all measurements to stop the cascading failures, so we should not merge this PR until we have that. We are missing local per receive rate-limit based on current capacity e.g resources we use. The current gubernator helps us but it's very manual and has only statically set rate limiting per whole system

Next steps:

  1. Add backoff rate limit per receive that can auto-tune based on resource limits
  2. Merge this PR when we do step 1

@bwplotka
Copy link
Member

bwplotka commented Jul 7, 2021

Added ticket: thanos-io/thanos#4425

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\hold as commented above.

@spaparaju
Copy link
Author

To sum our offline discussion on our sync:

  • This change overall makes sense from the controller point of view. As mentioned above from network proxy it makes sense to provide the current situation to hashing allowing receive to deal with missing pods.
  • We don't have all measurements to stop the cascading failures, so we should not merge this PR until we have that. We are missing local per receive rate-limit based on current capacity e.g resources we use. The current gubernator helps us but it's very manual and has only statically set rate limiting per whole system

Next steps:

  1. Add backoff rate limit per receive that can auto-tune based on resource limits
  2. Merge this PR when we do step 1
    backoff ratelimit per receive makes sense. thanks for creating an issue for tracking. Cascading failures, backoff rate-limiting are independent of this PR as cascading failures can happen even with the current master release of the receive-controller. (a failed write to a not-ready pod will be retried by a client to be passed onto to a ready pod)

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe there is a difference between adding new replica and restarting an existing replica. For 1000 tenants, we will start creating possibly 1000 new TSDBs across receivers for every even short restart of receive.

Also this attempt discussed on Slack states that such experiment (similar to this PR) failed due to larger memory consumption of receivers.

@bwplotka
Copy link
Member

Still I will be experimenting with different ideas here. My plan is to also create local test scenarios using e2e to mimic controller and failure cases.

@bwplotka
Copy link
Member

Early attempt already not working as expected: thanos-io/thanos#4968

@spaparaju
Copy link
Author

There are two parts to the problem of ingestion failures:

  1. Few replicas in the non-Ready status are sneaking into the hash-ring
  2. Better, consistent hash-ring

This PR fixes (1) from the above list.

@bill3tt
Copy link

bill3tt commented Dec 30, 2021

@spaparaju can we close this PR now that we have #80?

@spaparaju
Copy link
Author

Yes Ian. Opened a new PR (#80) for fixing this bug. Closing this PR.

@spaparaju spaparaju closed this Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants