Replies: 3 comments 18 replies
-
TBH, I don't really follow what are you doing and mainly why. Maybe you should describe it step by step including the YAMLs and the corresponding logs. But still ... why would you change the scheduling every time some chart version changes? The chart is not part of the Kafka deployment. Neither Strimzi nor Kafka care about the Helm chart and its version. So locking it into the deployment makes no sense to me. With things like Kafka, you want stability. That means not rolling them because some completely unrelevant changes. The log you shared from the operator suggests that the operator tries to roll the ZooKeeper pods (that is expected as you changed the labels and the topology spread). So what you need to do is to look why the ZooKeeper pod is not getting ready. Does it start? Is it scheduled? Or what exactly is causing it to not be ready. That is not clear from what you provided and clearly it is what blocks the operator.
I don't understand what this mean. What is
That is hard to comment on without seeing the actual YAMLs and the logs. Two more things I noticed:
|
Beta Was this translation helpful? Give feedback.
-
@scholzj we are actually seeing this behavior significantly more since we stopped adding the |
Beta Was this translation helpful? Give feedback.
-
Issue #8528 with the 0.35.1 update has fixed the issue. Thanks for the help @scholzj |
Beta Was this translation helpful? Give feedback.
-
We've configured topologyspreadconstraints in an effort to keep our setup balanced/spread out over multiple nodes. Part of the constraint is a label that contains a version of our deployment which increments on each update.
What we expect to happen
Zookeeper, kafka and cruisecontrol should get the new version label and be scheduled across different nodes.
What happens
Zookeeper goes offline (the pods are killed), and no new pods are recreated. The strimzi operator keeps trying to reconcile the pods, but gives a timeout. This is only resolved by manually restarting the operator deployment so that the operator gets a new 'strimzi revision' which triggers a successful reconciliation round.
The zookeeper strimzipodset has a status that says 1 pod is healthy even when no pods are present.
Configuration
We wrap external charts in a chart of our own. We also cache all images in a registry of our own.
Our operator setup is pretty much the default setup:
Our zookeeper setup:
The chartVersion can change from e.g. '1.7.0-master-44644' to '1.7.0-master-44700'.
Logs
Operator:
Question
Is this expected behavior? What can we do to prevent the faulty state from happening on each update?
This happens with both 1 and 3 replicas of zookeeper. When using 3 replicas only 2 pods are killed and the other is stuck trying to find the missing instances.
Zookeeper logs:
Beta Was this translation helpful? Give feedback.
All reactions