Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--allow-dynamic-scaling does not respond to pod disruptions #123

Open
tekicode opened this issue Oct 3, 2023 · 2 comments
Open

--allow-dynamic-scaling does not respond to pod disruptions #123

tekicode opened this issue Oct 3, 2023 · 2 comments

Comments

@tekicode
Copy link
Contributor

tekicode commented Oct 3, 2023

In the readme about --allow-dynamic-scaling:

By default, the controller does not react to voluntary/involuntary disruptions to receiver replicas in the StatefulSet. This flag allows the user to enable this behavior. When enabled, the controller will react to voluntary/involuntary disruptions to receiver replicas in the StatefulSet. When a Pod is marked for termination, the controller will remove it from the hashring and the replica essentially becomes a "router" for the hashring. When a Pod is deleted, the controller will remove it from the hashring. When a Pod becomes unready, the controller will remove it from the hashring. This behaviour can be considered for use alongside the Ketama hashing algorithm.

The two highlighted lines are incorrect, the controller does not have a podInformer subscribed to receive updates from pods associated with the hashring.

As such, the allow-dynamic-scaling flag only responds to changes in the replica count of the statefulset. This only happens if the statefulset is updated; that is separate from the health of pods.

I've explored adding a podInformer, updating the configmapInformer, and reworking the logic around how pods are chosen while keeping backwards compatibility.

But I've seen a lot of previous discussion/issues about this/related problems in the past. Is this seen as a problem? (it is to me) If so, what opinions do others have on how the controller should behave in this situation?

@christopherzli
Copy link
Contributor

I am also looking into this and wonder if there is any follow up?

@chit786
Copy link

chit786 commented Jun 4, 2024

+1 on this feature. We also encountered situation where the receiver waiting for WAL replay but still was not moved out of hashring, leading to writes being errored out with.

 msg="failed to handle request" err="get appender: TSDB not ready" 

I wonder if restartpolicy of pod could be utilised to mark it as "Terminating" and then the controller picks up from there ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants