Skip to content

TiDB transiently unavailable when rolling update the TiKV cluster #6131

@kos-team

Description

@kos-team

Bug Report

What version of Kubernetes are you using?
1.31

What version of TiDB Operator are you using?
v1.6.0

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?
standard

What's the status of the TiDB cluster pods?

NAME                                       READY   STATUS    RESTARTS      AGE
test-cluster-discovery-59d967d9f-nbdkf     1/1     Running   0             56m
test-cluster-pd-0                          1/1     Running   0             29m
test-cluster-pd-1                          1/1     Running   0             6m48s
test-cluster-pd-2                          1/1     Running   0             6m48s
test-cluster-ticdc-0                       1/1     Running   0             22m
test-cluster-ticdc-1                       1/1     Running   0             22m
test-cluster-ticdc-2                       1/1     Running   0             22m
test-cluster-tidb-0                        2/2     Running   0             22m
test-cluster-tidb-1                        2/2     Running   0             23m
test-cluster-tidb-2                        2/2     Running   0             24m
test-cluster-tiflash-0                     4/4     Running   0             26m
test-cluster-tiflash-1                     4/4     Running   0             27m
test-cluster-tiflash-2                     4/4     Running   0             28m
test-cluster-tikv-0                        1/1     Running   0             13m
test-cluster-tikv-1                        1/1     Running   0             8m33s
test-cluster-tikv-2                        1/1     Running   0             9m51s
tidb-controller-manager-59c5d6499f-55qwl   1/1     Running   0             57m

What did you do?

  1. Install cluster via applying CR
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test-cluster
spec:
  configUpdateStrategy: RollingUpdate
  enableDynamicConfiguration: true
  helper:
    image: alpine:3.16.0
  pd:
    baseImage: pingcap/pd
    config: "[dashboard]\n  internal-proxy = true\n"
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 10Gi
  pvReclaimPolicy: Retain
  ticdc:
    baseImage: pingcap/ticdc
    replicas: 3
  tidb:
    baseImage: pingcap/tidb
    config: "[performance]\n  tcp-keep-alive = true\n"
    maxFailoverCount: 0
    replicas: 3
    service:
      externalTrafficPolicy: Local
      type: NodePort
  tiflash:
    baseImage: pingcap/tiflash
    replicas: 3
    storageClaims:
    - resources:
        requests:
          storage: 10Gi
  tikv:
    baseImage: pingcap/tikv
    config: 'log-level = "info"

      '
    maxFailoverCount: 0
    mountClusterClientSecret: true
    replicas: 3
    requests:
      storage: 100Gi
  timezone: UTC
  version: v8.1.0
  1. Upgrade the TiKV cluster which causes the TiKV statefulset to be rolling upgraded. For example, changing the enableDynamicConfiguration from true to false.

What did you expect to see?
The cluster remains highly available during upgrade operations.

We tried to do the upgrade manually, by following the procedure of leader eviction, pod restart, and remove the leader eviction scheduler, and were able to maintain a 100% availability.

What did you see instead?
The cluster loses availability for one minute during the operation.
The root cause is improper leader eviction. We see that the operator tries to evict the leaders from the TiKV pod before restarting the pod. However, when restarting the last TiKV pod, the operator does not wait for the eviction to be fully completed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions