-
Notifications
You must be signed in to change notification settings - Fork 523
Description
Bug Report
What version of Kubernetes are you using?
1.31
What version of TiDB Operator are you using?
v1.6.0
What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?
standard
What's the status of the TiDB cluster pods?
NAME READY STATUS RESTARTS AGE
test-cluster-discovery-59d967d9f-nbdkf 1/1 Running 0 56m
test-cluster-pd-0 1/1 Running 0 29m
test-cluster-pd-1 1/1 Running 0 6m48s
test-cluster-pd-2 1/1 Running 0 6m48s
test-cluster-ticdc-0 1/1 Running 0 22m
test-cluster-ticdc-1 1/1 Running 0 22m
test-cluster-ticdc-2 1/1 Running 0 22m
test-cluster-tidb-0 2/2 Running 0 22m
test-cluster-tidb-1 2/2 Running 0 23m
test-cluster-tidb-2 2/2 Running 0 24m
test-cluster-tiflash-0 4/4 Running 0 26m
test-cluster-tiflash-1 4/4 Running 0 27m
test-cluster-tiflash-2 4/4 Running 0 28m
test-cluster-tikv-0 1/1 Running 0 13m
test-cluster-tikv-1 1/1 Running 0 8m33s
test-cluster-tikv-2 1/1 Running 0 9m51s
tidb-controller-manager-59c5d6499f-55qwl 1/1 Running 0 57m
What did you do?
- Install cluster via applying CR
apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
name: test-cluster
spec:
configUpdateStrategy: RollingUpdate
enableDynamicConfiguration: true
helper:
image: alpine:3.16.0
pd:
baseImage: pingcap/pd
config: "[dashboard]\n internal-proxy = true\n"
maxFailoverCount: 0
mountClusterClientSecret: true
replicas: 3
requests:
storage: 10Gi
pvReclaimPolicy: Retain
ticdc:
baseImage: pingcap/ticdc
replicas: 3
tidb:
baseImage: pingcap/tidb
config: "[performance]\n tcp-keep-alive = true\n"
maxFailoverCount: 0
replicas: 3
service:
externalTrafficPolicy: Local
type: NodePort
tiflash:
baseImage: pingcap/tiflash
replicas: 3
storageClaims:
- resources:
requests:
storage: 10Gi
tikv:
baseImage: pingcap/tikv
config: 'log-level = "info"
'
maxFailoverCount: 0
mountClusterClientSecret: true
replicas: 3
requests:
storage: 100Gi
timezone: UTC
version: v8.1.0
- Upgrade the TiKV cluster which causes the TiKV statefulset to be rolling upgraded. For example, changing the
enableDynamicConfigurationfromtruetofalse.
What did you expect to see?
The cluster remains highly available during upgrade operations.
We tried to do the upgrade manually, by following the procedure of leader eviction, pod restart, and remove the leader eviction scheduler, and were able to maintain a 100% availability.
What did you see instead?
The cluster loses availability for one minute during the operation.
The root cause is improper leader eviction. We see that the operator tries to evict the leaders from the TiKV pod before restarting the pod. However, when restarting the last TiKV pod, the operator does not wait for the eviction to be fully completed.