You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We find that zookeeper-operator could accidentally delete zookeeper pod and its PVC when the controller reads stale information about ZookeeperCluster.
More concretely, we observe that if a ZookeeperCluster goes through scale down and then scale up, and one of the apisevers gets slow (or partitioned from others), the (restarted) controller may read the stale status/spec information of the ZookeeperCluster. The stale information can trigger the PVC deletion inreconcileFinalizers and statefulset scale down (zookeeper pod deletion) in reconcileStatefulSet.
We list concrete reproduction steps as below:
Run the controller in a HA k8s cluster and create a ZookeeperCluster zkc with replicas=2. The controller is talking to apiserver1 (which is not stale). There will be two zookeeper pods zk1 and zk2 and two PVC in the cluster.
Scale zkc down (by setting replicas=1). After getting stable, there will only be one zookeeper pod zk1 and one PVC. Meanwhile, apiserver2 gets partitioned so its watch cache stops at the moment that zkc has replicas=1.
Scale zkc up (by setting replicas=2). Now a new zk2 and its PVC get back.
After experiencing a crash, the restarted controller talks to the stale apiserver2. From apiserver2's watch cache, the controller finds that zkc's Status.ReadyReplicas and Spec.Replicas are both 1, which is lower than the PVC count (i.e., 2). Inside cleanupOrphanPVCs the controller will treat the PVC of zk2 as an orphan PVC and delete it. Later in Updating StatefulSet, the controller will also set the Replicas of the statefulset back to 1, which will trigger deletion of zk2.
Importance
blocker: The unexpected PVC and pod deletion caused by reading stale data from the apiserver can further lead to data loss or availability issues.
Location
zookeepercluster_controller.go
Suggestions for an improvement
We are willing to help alleviate this problem by issuing a PR.
It is hard to fully avoid this issue because the controller has no clue whether it is reading the stale information of the ZookeeperCluster or not, given the controller is supposed to be stateless. However, some sanity check can help prevent this issue in certain cases and ensure the controller is not going to perform undesired deletion before the updated information is populated to the controller:
Before issuing PVC deletion in cleanupOrphanPVCs, we can first check the number of zookeeper pods and see whether the number is the same as Spec.Replicas of the ZookeeperCluster. If not, we know at least one of them is wrong and the PVC may not be an orphan. The sanity check is helpful if the pod information is fresh (especially when the pod information is retrieved from a different apiserver/etcd than the ZookeeperCluster information).
Before updating the statefulset in reconcileStatefulSet, we may want to check ZookeeperCluster.Spec.Replicas, ZookeeperCluster.Status.Replicas and StatefulSet.Spec.Replicas. For example, it is a sign of abnormality if ZookeeperCluster.Spec.Replicas != StatefulSet.Spec.Replicas but ZookeeperCluster.Spec.Replicas == ZookeeperCluster.Status.Replicas. Logging a warning and returning from reconcile() with error in such case can be helpful.
We are willing to issue a PR to add the above checks if they are not considered too expensive.
The text was updated successfully, but these errors were encountered:
We just find that there would be a better way to totally avoid this problem: we can label the sts with the resource version number of zkc when creating the sts. Later when performing any scaling down, we can compare the resource version R1 of the (potential stale) zkc R1 and the resource version number R2 labeled to the sts. If we find R1 < R2, then the zkc comes from history and we should not perform deletion for it.
We will try to issue a PR this or next week for the fix.
Description
We find that zookeeper-operator could accidentally delete zookeeper pod and its PVC when the controller reads stale information about ZookeeperCluster.
More concretely, we observe that if a ZookeeperCluster goes through scale down and then scale up, and one of the apisevers gets slow (or partitioned from others), the (restarted) controller may read the stale status/spec information of the ZookeeperCluster. The stale information can trigger the PVC deletion in
reconcileFinalizers
and statefulset scale down (zookeeper pod deletion) inreconcileStatefulSet
.We list concrete reproduction steps as below:
zkc
withreplicas=2
. The controller is talking to apiserver1 (which is not stale). There will be two zookeeper podszk1
andzk2
and two PVC in the cluster.zkc
down (by settingreplicas=1
). After getting stable, there will only be one zookeeper podzk1
and one PVC. Meanwhile, apiserver2 gets partitioned so its watch cache stops at the moment thatzkc
hasreplicas=1
.zkc
up (by settingreplicas=2
). Now a newzk2
and its PVC get back.zkc
'sStatus.ReadyReplicas
andSpec.Replicas
are both 1, which is lower than the PVC count (i.e., 2). InsidecleanupOrphanPVCs
the controller will treat the PVC ofzk2
as an orphan PVC and delete it. Later inUpdating StatefulSet
, the controller will also set theReplicas
of the statefulset back to 1, which will trigger deletion ofzk2
.Importance
blocker: The unexpected PVC and pod deletion caused by reading stale data from the apiserver can further lead to data loss or availability issues.
Location
zookeepercluster_controller.go
Suggestions for an improvement
We are willing to help alleviate this problem by issuing a PR.
It is hard to fully avoid this issue because the controller has no clue whether it is reading the stale information of the ZookeeperCluster or not, given the controller is supposed to be stateless. However, some sanity check can help prevent this issue in certain cases and ensure the controller is not going to perform undesired deletion before the updated information is populated to the controller:
cleanupOrphanPVCs
, we can first check the number of zookeeper pods and see whether the number is the same asSpec.Replicas
of the ZookeeperCluster. If not, we know at least one of them is wrong and the PVC may not be an orphan. The sanity check is helpful if the pod information is fresh (especially when the pod information is retrieved from a different apiserver/etcd than the ZookeeperCluster information).reconcileStatefulSet
, we may want to checkZookeeperCluster.Spec.Replicas
,ZookeeperCluster.Status.Replicas
andStatefulSet.Spec.Replicas
. For example, it is a sign of abnormality ifZookeeperCluster.Spec.Replicas != StatefulSet.Spec.Replicas
butZookeeperCluster.Spec.Replicas == ZookeeperCluster.Status.Replicas
. Logging a warning and returning from reconcile() with error in such case can be helpful.We are willing to issue a PR to add the above checks if they are not considered too expensive.
The text was updated successfully, but these errors were encountered: