Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reading stale ZookeeperCluster spec/status can lead to undesired pod and PVC deletion #314

Closed
srteam2020 opened this issue Mar 24, 2021 · 1 comment · Fixed by #326, #355 or #359

Comments

@srteam2020
Copy link
Contributor

srteam2020 commented Mar 24, 2021

Description

We find that zookeeper-operator could accidentally delete zookeeper pod and its PVC when the controller reads stale information about ZookeeperCluster.

More concretely, we observe that if a ZookeeperCluster goes through scale down and then scale up, and one of the apisevers gets slow (or partitioned from others), the (restarted) controller may read the stale status/spec information of the ZookeeperCluster. The stale information can trigger the PVC deletion inreconcileFinalizers and statefulset scale down (zookeeper pod deletion) in reconcileStatefulSet.

We list concrete reproduction steps as below:

  1. Run the controller in a HA k8s cluster and create a ZookeeperCluster zkc with replicas=2. The controller is talking to apiserver1 (which is not stale). There will be two zookeeper pods zk1 and zk2 and two PVC in the cluster.
  2. Scale zkc down (by setting replicas=1). After getting stable, there will only be one zookeeper pod zk1 and one PVC. Meanwhile, apiserver2 gets partitioned so its watch cache stops at the moment that zkc has replicas=1.
  3. Scale zkc up (by setting replicas=2). Now a new zk2 and its PVC get back.
  4. After experiencing a crash, the restarted controller talks to the stale apiserver2. From apiserver2's watch cache, the controller finds that zkc's Status.ReadyReplicas and Spec.Replicas are both 1, which is lower than the PVC count (i.e., 2). Inside cleanupOrphanPVCs the controller will treat the PVC of zk2 as an orphan PVC and delete it. Later in Updating StatefulSet, the controller will also set the Replicas of the statefulset back to 1, which will trigger deletion of zk2.

Importance

blocker: The unexpected PVC and pod deletion caused by reading stale data from the apiserver can further lead to data loss or availability issues.

Location

zookeepercluster_controller.go

Suggestions for an improvement

We are willing to help alleviate this problem by issuing a PR.
It is hard to fully avoid this issue because the controller has no clue whether it is reading the stale information of the ZookeeperCluster or not, given the controller is supposed to be stateless. However, some sanity check can help prevent this issue in certain cases and ensure the controller is not going to perform undesired deletion before the updated information is populated to the controller:

  1. Before issuing PVC deletion in cleanupOrphanPVCs, we can first check the number of zookeeper pods and see whether the number is the same as Spec.Replicas of the ZookeeperCluster. If not, we know at least one of them is wrong and the PVC may not be an orphan. The sanity check is helpful if the pod information is fresh (especially when the pod information is retrieved from a different apiserver/etcd than the ZookeeperCluster information).
  2. Before updating the statefulset in reconcileStatefulSet, we may want to check ZookeeperCluster.Spec.Replicas, ZookeeperCluster.Status.Replicas and StatefulSet.Spec.Replicas. For example, it is a sign of abnormality if ZookeeperCluster.Spec.Replicas != StatefulSet.Spec.Replicas but ZookeeperCluster.Spec.Replicas == ZookeeperCluster.Status.Replicas. Logging a warning and returning from reconcile() with error in such case can be helpful.

We are willing to issue a PR to add the above checks if they are not considered too expensive.

@srteam2020
Copy link
Contributor Author

We just find that there would be a better way to totally avoid this problem: we can label the sts with the resource version number of zkc when creating the sts. Later when performing any scaling down, we can compare the resource version R1 of the (potential stale) zkc R1 and the resource version number R2 labeled to the sts. If we find R1 < R2, then the zkc comes from history and we should not perform deletion for it.
We will try to issue a PR this or next week for the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment