Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

domain: fast new a etcd session when the session is stale in the schemaVersionSyncer #7774

Merged
merged 4 commits into from Sep 28, 2018

Conversation

@winkyao
Copy link
Member

commented Sep 25, 2018

Fix issue that when the pd leader is down, the etcd session will be stale, and tidb may costs several minutes to create a new session to the etcd in some case.

I have done a manual test using ansible, with inventory.ini:

## TiDB Cluster Part
[tidb_servers]
192.168.59.105

[tikv_servers]
192.168.59.105

[pd_servers]
192.168.59.105
192.168.59.103

[spark_master]

[spark_slaves]

## Monitoring Part
# prometheus and pushgateway servers
[monitoring_servers]
192.168.59.105

[grafana_servers]
192.168.59.105

# node_exporter and blackbox_exporter servers
[monitored_servers]
192.168.59.105

[alertmanager_servers]
192.168.59.105

[kafka_exporter_servers]

## Binlog Part
[pump_servers:children]
tidb_servers

[drainer_servers]

## Group variables
[pd_servers:vars]
# location_labels = ["zone","rack","host"]

## Global variables
[all:vars]
deploy_dir = /home/tidb/deploy

## Connection
# ssh via normal user
ansible_user = wink

cluster_name = test-cluster

tidb_version = v2.0.1

# process supervision, [systemd, supervise]
process_supervision = systemd

# timezone of deployment region
timezone = Asia/Shanghai
set_timezone = True

enable_firewalld = False
# check NTP service
enable_ntpd = True
set_hostname = False
# CPU, memory and disk performance will not be checked when dev_mode = True
dev_mode = False

## binlog trigger
enable_binlog = False
# zookeeper address of kafka cluster for binlog, example:
# zookeeper_addrs = "192.168.0.11:2181,192.168.0.12:2181,192.168.0.13:2181"
zookeeper_addrs = ""
# kafka cluster address for monitoring, example:
# kafka_addrs = "192.168.0.11:9092,192.168.0.12:9092,192.168.0.13:9092"
kafka_addrs = ""

# store slow query log into seperate file
enable_slow_query_log = False

# enable TLS authentication in the TiDB cluster
enable_tls = False

# KV mode
deploy_without_tidb = False

# Optional: Set if you already have a alertmanager server.
# Format: alertmanager_host:alertmanager_port
alertmanager_target = ""

grafana_admin_user = "admin"
grafana_admin_password = "admin"


### Collect diagnosis
collect_log_recent_hours = 2

enable_bandwidth_limit = True
# default: 10Mb/s, unit: Kbit/s
collect_bandwidth_limit = 10000

There are 2 pd, 1 tidb, 1 tikv in the test cluster, I start a client which continue inserting data into tidb, and I kill -9 the pd leader, we can see the tidb report errors:

2018/09/28 11:53:58.800 domain.go:417: [info] [ddl] reload schema in loop, server info syncer need restart

After the pd recover: 2018/09/28 11:54:59.562 leader.go:269: [info] PD cluster leader pd1 is ready to serve, the etcd session is recovered:

2018/09/28 11:54:59.545 domain.go:419: [info] [ddl] server info syncer restarted.

It costs several millseconds to recover.

@winkyao winkyao added the status/DNM label Sep 25, 2018
@@ -88,7 +90,7 @@ type SchemaSyncer interface {
type schemaVersionSyncer struct {
selfSchemaVerPath string
etcdCli *clientv3.Client
session *concurrency.Session
session unsafe.Pointer

This comment has been minimized.

Copy link
@crazycs520

crazycs520 Sep 26, 2018

Contributor

Why change this?

This comment has been minimized.

Copy link
@winkyao

winkyao Sep 26, 2018

Author Member

session is accessed in multi-thread.

This comment has been minimized.

Copy link
@zimulala

zimulala Sep 28, 2018

Member

Can you talk about it in detail?

This comment has been minimized.

Copy link
@winkyao

winkyao Sep 28, 2018

Author Member

loadSnapshotInfoSchemaIfNeeded will use the session, and loadSchemaLoop will too.

This comment has been minimized.

Copy link
@zimulala

zimulala Sep 28, 2018

Member

Gotcha.

@winkyao winkyao force-pushed the winkyao:fast_failover_etcd branch from 00d4509 to 58fcff5 Sep 26, 2018
@winkyao winkyao removed the status/DNM label Sep 28, 2018
@winkyao

This comment has been minimized.

Copy link
Member Author

commented Sep 28, 2018

Copy link
Contributor

left a comment

LGTM

Copy link
Contributor

left a comment

LGTM

@ciscoxll

This comment has been minimized.

Copy link
Contributor

commented Sep 28, 2018

@zimulala PTAL.

@winkyao winkyao added the status/LGT2 label Sep 28, 2018
err := do.mustRestartSyncer()
if err != nil {
log.Errorf("[ddl] reload schema in loop, schema syncer restart err %v", errors.ErrorStack(err))
break
}
do.SchemaValidator.Restart()

This comment has been minimized.

Copy link
@zimulala

zimulala Sep 28, 2018

Member

Need to add metrics here?

This comment has been minimized.

Copy link
@winkyao

winkyao Sep 28, 2018

Author Member

we already have metrics.NewSessionHistogram and metrics.DeploySyncerHistogram, I think there is no need to add here.

This comment has been minimized.

Copy link
@zimulala

zimulala Sep 28, 2018

Member

Got it.

@winkyao

This comment has been minimized.

Copy link
Member Author

commented Sep 28, 2018

/run-all-tests

Copy link
Member

left a comment

LGTM

@@ -88,7 +90,7 @@ type SchemaSyncer interface {
type schemaVersionSyncer struct {
selfSchemaVerPath string
etcdCli *clientv3.Client
session *concurrency.Session
session unsafe.Pointer

This comment has been minimized.

Copy link
@zimulala

zimulala Sep 28, 2018

Member

Gotcha.

@winkyao winkyao merged commit 6a1e94f into pingcap:master Sep 28, 2018
11 checks passed
11 checks passed
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
jenkins-ci-tidb/build Jenkins job succeeded.
Details
jenkins-ci-tidb/common-test Jenkins job succeeded.
Details
jenkins-ci-tidb/integration-common-test Jenkins job succeeded.
Details
jenkins-ci-tidb/integration-compatibility-test Jenkins job succeeded.
Details
jenkins-ci-tidb/integration-ddl-test Jenkins job succeeded.
Details
jenkins-ci-tidb/mybatis-test Jenkins job succeeded.
Details
jenkins-ci-tidb/sqllogic-test Jenkins job succeeded.
Details
jenkins-ci-tidb/unit-test Jenkins job succeeded.
Details
license/cla Contributor License Agreement is signed.
Details
@winkyao winkyao deleted the winkyao:fast_failover_etcd branch Sep 28, 2018
winkyao added a commit to winkyao/tidb that referenced this pull request Sep 28, 2018
winkyao added a commit that referenced this pull request Oct 8, 2018
…maVersionSyncer #7774 (#7810)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.