TiDB transiently unavailable when rolling update the TiKV cluster

## Bug Report

**What version of Kubernetes are you using?**
1.31

**What version of TiDB Operator are you using?**
v1.6.0

**What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?**
standard

**What's the status of the TiDB cluster pods?**
```
NAME                                       READY   STATUS    RESTARTS      AGE
test-cluster-discovery-59d967d9f-nbdkf     1/1     Running   0             56m
test-cluster-pd-0                          1/1     Running   0             29m
test-cluster-pd-1                          1/1     Running   0             6m48s
test-cluster-pd-2                          1/1     Running   0             6m48s
test-cluster-ticdc-0                       1/1     Running   0             22m
test-cluster-ticdc-1                       1/1     Running   0             22m
test-cluster-ticdc-2                       1/1     Running   0             22m
test-cluster-tidb-0                        2/2     Running   0             22m
test-cluster-tidb-1                        2/2     Running   0             23m
test-cluster-tidb-2                        2/2     Running   0             24m
test-cluster-tiflash-0                     4/4     Running   0             26m
test-cluster-tiflash-1                     4/4     Running   0             27m
test-cluster-tiflash-2                     4/4     Running   0             28m
test-cluster-tikv-0                        1/1     Running   0             13m
test-cluster-tikv-1                        1/1     Running   0             8m33s
test-cluster-tikv-2                        1/1     Running   0             9m51s
tidb-controller-manager-59c5d6499f-55qwl   1/1     Running   0             57m
```

**What did you do?**
1. Install cluster via applying CR
```
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  ticdc:
    baseImage: pingcap/ticdc
    replicas: 3
  tidb:
    baseImage: pingcap/tidb
    config: "[performance]\n  tcp-keep-alive = true\n"
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tiflash:
    baseImage: pingcap/tiflash
    replicas: 3
    storageClaims:
    - resources:
        requests:
          storage: 10Gi
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0
```
2. Upgrade the TiKV cluster which causes the TiKV statefulset to be rolling upgraded. For example, changing the `enableDynamicConfiguration` from `true` to `false`.

**What did you expect to see?**
The cluster remains highly available during upgrade operations.

We tried to do the upgrade manually, by following the procedure of leader eviction, pod restart, and remove the leader eviction scheduler, and were able to maintain a 100% availability.

**What did you see instead?**
The cluster loses availability for one minute during the operation.
The root cause is improper leader eviction. We see that the operator tries to evict the leaders from the TiKV pod before restarting the pod. However, when restarting the last TiKV pod, the operator does not wait for the eviction to be fully completed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TiDB transiently unavailable when rolling update the TiKV cluster #6131

Bug Report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TiDB transiently unavailable when rolling update the TiKV cluster #6131

Description

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions