-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generated Hashring to contain only the statefulset replicas in the Ready status #75
Conversation
…ulset does not result in intended # of replicas
a001dcb
to
c35a38d
Compare
@metalmatze can I get a review for this PR. thanks! |
podName := fmt.Sprintf("%s-%d", sts.Name, i) | ||
var endpoints []string | ||
// Iterate over new replicas to wait until they are running | ||
for i := 0; i < int(*sts.Spec.Replicas); i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the right place to use a WaitGroup (https://gobyexample.com/waitgroups) and use a global timeout so we wait maximum t
seconds and not n*t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to parallelize and reduce the wait time. The slice 'endpoints' would need to be shared and updated across the go routines thru a Channel would have impact on the level parallelism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the consensus on this are we going to use the waitGrouos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for not using waitGroups
in this case.
- Employing
waitGroups
reduce the overall wait time (for pods to come up), but require handle synchronize append()s to a shared data structure across go routines. - The 1 min. is the maximum time that receive-controller wait for a pod to come up, i.e. if a pod does come up earlier than 1 min., 1 min. wait time does not occur.
- With
waitGroups
- if the one of pods does take 1 min. time to come up, would negate any advantage of employing waitGroups.
I'm still not 100% sure if this is really needed. 🤔 |
This fix covers a specific case - updating hash-ring if scale up does not reach desired # of replicas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to merge and test this on a real-life system if nobody has major objections.
I would love to test this out locally first within e2e. @ianbillett could you take a look (not write test, just look if that makes sense) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the logic here is pretty simple - e2e testing with envtest or kind feels too heavyweight. The fake client testing is appropriate for the complexity of this controller 👍
IMO we're only really testing static cases here, there is no definition of behaviour when the stateful set changes status.
main.go
Outdated
start := time.Now() | ||
podName := fmt.Sprintf("%s-%d", sts.Name, i) | ||
var endpoints []string | ||
// Iterate over new replicas to wait until they are running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you didn't write it initially - but this comment currently does not match the implementation:
- We iterate over all replicas - not just new ones.
- We don't wait for them to be running, we wait a maximum of 1 minute for each pod to become ready, and skip adding them to the hashring if they are not ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you didn't write it initially - but this comment currently does not match the implementation:
- We iterate over all replicas - not just new ones.
Thats right. Updated the comment to clarify that we iterate thru all the replicas.- We don't wait for them to be running, we wait a maximum of 1 minute for each pod to become ready, and skip adding them to the hashring if they are not ready.
waitForPod() is no-op for replicas in 'Running' status and wait for 1 min. for non-ready to get to 'Ready' status. The non-ready replicas (after 1 min.) are not added to the Hashring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this. I still think making hashring changes on every unready pod can cause failover scenario.
For example:
- If we have 7 replicas
- 7th is suddenly crashing for some reason
- controller changing hashring to have only 6 replicas
- it increases load on 6 replicas so e.g 6th is crashing too
- now controller changes to 5 replicas, which brings more load to 5
- whole system is on fire for all tenants.
WDYT? 🤔
I think we should wait with replica number increase on hashring until first pod is ready yes, but later unreadiness should be assumed as crash, no?
This scenario (one of the replicas in Ready status going down == not able to serve traffic) is a possible case, limiting the incoming traffic (based on replica count in Ready status) can be one solution. Role of hashring-controller (or plain Kube SS controller) is to contain / send traffic to the replica addresses which can serve traffic.
|
To sum our offline discussion on our sync:
Next steps:
|
Added ticket: thanos-io/thanos#4425 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\hold as commented above.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still believe there is a difference between adding new replica and restarting an existing replica. For 1000 tenants, we will start creating possibly 1000 new TSDBs across receivers for every even short restart of receive.
Also this attempt discussed on Slack states that such experiment (similar to this PR) failed due to larger memory consumption of receivers.
Still I will be experimenting with different ideas here. My plan is to also create local test scenarios using e2e to mimic controller and failure cases. |
Early attempt already not working as expected: thanos-io/thanos#4968 |
There are two parts to the problem of ingestion failures:
This PR fixes (1) from the above list. |
@spaparaju can we close this PR now that we have #80? |
Yes Ian. Opened a new PR (#80) for fixing this bug. Closing this PR. |
Currently when the statefulset is scaled (lower number to higher number), pods are being checked to be in ready condition before the respective endpoints are considered to be part of Hashring configMap. However in subsequent runs of thanos-receiver-controller loop, the entire of statefulset spec. of replicas (intended # of replicas) is considered for Hashring generation. This can be a problem if scale up fails to reach the intended # of replicas.
This PR :