[BUG] Reading stale ZookeeperCluster spec/status can lead to undesired pod and PVC deletion #314

srteam2020 · 2021-03-24T16:42:50Z

Description

We find that zookeeper-operator could accidentally delete zookeeper pod and its PVC when the controller reads stale information about ZookeeperCluster.

More concretely, we observe that if a ZookeeperCluster goes through scale down and then scale up, and one of the apisevers gets slow (or partitioned from others), the (restarted) controller may read the stale status/spec information of the ZookeeperCluster. The stale information can trigger the PVC deletion inreconcileFinalizers and statefulset scale down (zookeeper pod deletion) in reconcileStatefulSet.

We list concrete reproduction steps as below:

Run the controller in a HA k8s cluster and create a ZookeeperCluster zkc with replicas=2. The controller is talking to apiserver1 (which is not stale). There will be two zookeeper pods zk1 and zk2 and two PVC in the cluster.
Scale zkc down (by setting replicas=1). After getting stable, there will only be one zookeeper pod zk1 and one PVC. Meanwhile, apiserver2 gets partitioned so its watch cache stops at the moment that zkc has replicas=1.
Scale zkc up (by setting replicas=2). Now a new zk2 and its PVC get back.
After experiencing a crash, the restarted controller talks to the stale apiserver2. From apiserver2's watch cache, the controller finds that zkc's Status.ReadyReplicas and Spec.Replicas are both 1, which is lower than the PVC count (i.e., 2). Inside cleanupOrphanPVCs the controller will treat the PVC of zk2 as an orphan PVC and delete it. Later in Updating StatefulSet, the controller will also set the Replicas of the statefulset back to 1, which will trigger deletion of zk2.

Importance

blocker: The unexpected PVC and pod deletion caused by reading stale data from the apiserver can further lead to data loss or availability issues.

Location

zookeepercluster_controller.go

Suggestions for an improvement

We are willing to help alleviate this problem by issuing a PR.
It is hard to fully avoid this issue because the controller has no clue whether it is reading the stale information of the ZookeeperCluster or not, given the controller is supposed to be stateless. However, some sanity check can help prevent this issue in certain cases and ensure the controller is not going to perform undesired deletion before the updated information is populated to the controller:

Before issuing PVC deletion in cleanupOrphanPVCs, we can first check the number of zookeeper pods and see whether the number is the same as Spec.Replicas of the ZookeeperCluster. If not, we know at least one of them is wrong and the PVC may not be an orphan. The sanity check is helpful if the pod information is fresh (especially when the pod information is retrieved from a different apiserver/etcd than the ZookeeperCluster information).
Before updating the statefulset in reconcileStatefulSet, we may want to check ZookeeperCluster.Spec.Replicas, ZookeeperCluster.Status.Replicas and StatefulSet.Spec.Replicas. For example, it is a sign of abnormality if ZookeeperCluster.Spec.Replicas != StatefulSet.Spec.Replicas but ZookeeperCluster.Spec.Replicas == ZookeeperCluster.Status.Replicas. Logging a warning and returning from reconcile() with error in such case can be helpful.

We are willing to issue a PR to add the above checks if they are not considered too expensive.

The text was updated successfully, but these errors were encountered:

srteam2020 · 2021-04-15T19:56:58Z

We just find that there would be a better way to totally avoid this problem: we can label the sts with the resource version number of zkc when creating the sts. Later when performing any scaling down, we can compare the resource version R1 of the (potential stale) zkc R1 and the resource version number R2 labeled to the sts. If we find R1 < R2, then the zkc comes from history and we should not perform deletion for it.
We will try to issue a PR this or next week for the fix.

srteam2020 mentioned this issue May 10, 2021

fix issue-314 by checking zookeeperCluster resource version before updating sts #326

Merged

anishakj closed this as completed in #326 Jun 7, 2021

This was referenced Jun 30, 2021

fix issue-314 by checking zookeeperCluster resource version before updating sts ADDENDUM #355

Merged

fix issue-314 by checking zookeeperCluster resource version before updating sts ADDENDUM-2 #359

Merged

anishakj added this to the Release 0.2.12 milestone Jul 20, 2021

nickumia-reisys mentioned this issue Apr 18, 2022

SOLR cloud leader loss recovery GSA/data.gov#3784

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Reading stale ZookeeperCluster spec/status can lead to undesired pod and PVC deletion #314

[BUG] Reading stale ZookeeperCluster spec/status can lead to undesired pod and PVC deletion #314

srteam2020 commented Mar 24, 2021 •

edited

Loading

srteam2020 commented Apr 15, 2021

[BUG] Reading stale ZookeeperCluster spec/status can lead to undesired pod and PVC deletion #314

[BUG] Reading stale ZookeeperCluster spec/status can lead to undesired pod and PVC deletion #314

Comments

srteam2020 commented Mar 24, 2021 • edited Loading

Description

Importance

Location

Suggestions for an improvement

srteam2020 commented Apr 15, 2021

srteam2020 commented Mar 24, 2021 •

edited

Loading