Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After importing data using lightning, TiDB fails to connect PD #11767

Closed
wonderful1984 opened this issue Aug 19, 2019 · 21 comments
Closed

After importing data using lightning, TiDB fails to connect PD #11767

wonderful1984 opened this issue Aug 19, 2019 · 21 comments
Assignees
Labels
type/bug This issue is a bug.

Comments

@wonderful1984
Copy link

wonderful1984 commented Aug 19, 2019

1、使用lightning导入数据后,tidb无法链接pd

[2019/08/19 09:32:58.465 +08:00] [INFO] [region_cache.go:827] ["region epoch is ahead of tikv"] [error="region epoch is ahead of tikv. rpc ctx: region ID: 1332486, meta: id:1332486 start_key:\"t\\200\\000\\000\\000\\000\\000\\n\\350_r\\200\\000\\000\\000\\010;\\211=\" end_key:\"t\\200\\000\\000\\000\\000\\000\\n\\350_r\\200\\000\\000\\000\\010G\\361o\" region_epoch: peers: peers: peers: , peer: id:1332489 store_id:801041 , addr: 10.218.110.59:20160, idx: 2, currentRegions: [id:1332486 start_key:\"t\\200\\000\\000\\000\\000\\000\\n\\377\\350_r\\200\\000\\000\\000\\010\\377;\\211=\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\n\\377\\350_r\\200\\000\\000\\000\\010\\377A^\\231\\000\\000\\000\\000\\000\\372\" region_epoch: peers: peers: peers:  id:1327216 start_key:\"t\\200\\000\\000\\000\\000\\000\\n\\377\\350_r\\200\\000\\000\\000\\010\\3774\\351\\357\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\n\\377\\350_r\\200\\000\\000\\000\\010\\377;\\211=\\000\\000\\000\\000\\000\\372\" region_epoch: peers: peers: peers: ]"] [errorVerbose="region epoch is ahead of tikv. rpc ctx: region ID: 1332486, meta: id:1332486 start_key:\"t\\200\\000\\000\\000\\000\\000\\n\\350_r\\200\\000\\000\\000\\010;\\211=\" end_key:\"t\\200\\000\\000\\000\\000\\000\\n\\350_r\\200\\000\\000\\000\\010G\\361o\" region_epoch: peers: peers: peers: , peer: id:1332489 store_id:801041 , addr: 10.218.110.59:20160, idx: 2, currentRegions: [id:1332486 start_key:\"t\\200\\000\\000\\000\\000\\000\\n\\377\\350_r\\200\\000\\000\\000\\010\\377;\\211=\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\n\\377\\350_r\\200\\000\\000\\000\\010\\377A^\\231\\000\\000\\000\\000\\000\\372\" region_epoch: peers: peers: peers:  id:1327216 start_key:\"t\\200\\000\\000\\000\\000\\000\\n\\377\\350_r\\200\\000\\000\\000\\010\\3774\\351\\357\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\n\\377\\350_r\\200\\000\\000\\000\\010\\377;\\211=\\000\\000\\000\\000\\000\\372\" region_epoch: peers: peers: peers: ]\ngithub.com/pingcap/tidb/store/tikv.(*RegionCache).OnRegionEpochNotMatch\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/region_cache.go:826\ngithub.com/pingcap/tidb/store/tikv.(*RegionRequestSender).onRegionError\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:252\ngithub.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:128\ngithub.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReq\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:72\ngithub.com/pingcap/tidb/store/tikv.(*tikvStore).SendReq\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/kv.go:374\ngithub.com/pingcap/tidb/store/tikv/gcworker.(*GCWorker).resolveLocksForRange\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/gcworker/gc_worker.go:833\ngithub.com/pingcap/tidb/store/tikv/gcworker.(*GCWorker).resolveLocks.func1\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/gcworker/gc_worker.go:786\ngithub.com/pingcap/tidb/store/tikv.(*rangeTaskWorker).run\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/range_task.go:279\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337"]

["Release Version"=v4.0.0-alpha-74-gdb6c36c75] ["Git Commit Hash"=db6c36c75f64d2f6d92f6ef5b810b96ce6a42f09] ["Git Branch"=master]

2、重启tidb,报错无法启动

[2019/08/19 09:30:58.085 +08:00] [INFO] [systime_mon.go:25] ["start system time monitor"]
[2019/08/19 09:30:58.087 +08:00] [INFO] [client.go:256] ["[pd] switch leader"] [new-leader=http://10.218.60.210:2379] [old-leader=]
[2019/08/19 09:30:58.087 +08:00] [INFO] [client.go:167] ["[pd] init cluster id"] [cluster-id=6711871066048424412]
[2019/08/19 09:30:58.107 +08:00] [INFO] [store.go:74] ["new store with retry success"]
[2019/08/19 09:31:54.939 +08:00] [WARN] [backoff.go:326] ["pdRPC backoffer.maxSleep 20000ms is exceeded, errors:\nloadRegion from PD failed, key: \"mBootstra\\xffpKey\\x00\\x00\\x00\\x00\\xfb\\x00\\x00\\x00\\x00\\x00\\x00\\x00s\", err: rpc error: code = DeadlineExceeded desc = context deadline exceeded at 2019-08-19T09:31:45.371274478+08:00\nloadRegion from PD failed, key: \"mBootstra\\xffpKey\\x00\\x00\\x00\\x00\\xfb\\x00\\x00\\x00\\x00\\x00\\x00\\x00s\", err: rpc error: code = DeadlineExceeded desc = context deadline exceeded at 2019-08-19T09:31:49.875650747+08:00\nloadRegion from PD failed, key: \"mBootstra\\xffpKey\\x00\\x00\\x00\\x00\\xfb\\x00\\x00\\x00\\x00\\x00\\x00\\x00s\", err: rpc error: code = DeadlineExceeded desc = context deadline exceeded at 2019-08-19T09:31:54.939015454+08:00"]
[2019/08/19 09:31:54.939 +08:00] [FATAL] [session.go:1672] ["check bootstrapped failed"] [error="[tikv:9001]PD server timeout"] [errorVerbose="[tikv:9001]PD server timeout\ngithub.com/pingcap/errors.AddStack\n\t/home/jenkins/workspace/build_tidb_master/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/home/jenkins/workspace/build_tidb_master/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15\ngithub.com/pingcap/tidb/store/tikv.(*RegionCache).loadRegion\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/region_cache.go:626\ngithub.com/pingcap/tidb/store/tikv.(*RegionCache).findRegionByKey\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/region_cache.go:358\ngithub.com/pingcap/tidb/store/tikv.(*RegionCache).LocateKey\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/region_cache.go:318\ngithub.com/pingcap/tidb/store/tikv.(*tikvSnapshot).get\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:244\ngithub.com/pingcap/tidb/store/tikv.(*tikvSnapshot).Get\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:222\ngithub.com/pingcap/tidb/kv.(*unionStore).Get\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/kv/union_store.go:194\ngithub.com/pingcap/tidb/store/tikv.(*tikvTxn).Get\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/store/tikv/txn.go:149\ngithub.com/pingcap/tidb/structure.(*TxStructure).Get\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/structure/string.go:36\ngithub.com/pingcap/tidb/structure.(*TxStructure).GetInt64\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/structure/string.go:45\ngithub.com/pingcap/tidb/meta.(*Meta).GetBootstrapVersion\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/meta/meta.go:708\ngithub.com/pingcap/tidb/session.getStoreBootstrapVersion.func1\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/session.go:1667\ngithub.com/pingcap/tidb/kv.RunInNewTxn\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/kv/txn.go:47\ngithub.com/pingcap/tidb/session.getStoreBootstrapVersion\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/session.go:1664\ngithub.com/pingcap/tidb/session.BootstrapSession\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/session.go:1503\nmain.createStoreAndDomain\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/tidb-server/main.go:234\nmain.main\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/tidb-server/main.go:176\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337"] [stack="github.com/pingcap/tidb/session.getStoreBootstrapVersion\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/session.go:1672\ngithub.com/pingcap/tidb/session.BootstrapSession\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/session/session.go:1503\nmain.createStoreAndDomain\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/tidb-server/main.go:234\nmain.main\n\t/home/jenkins/workspace/build_tidb_master/go/src/github.com/pingcap/tidb/tidb-server/main.go:176\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"]
@wonderful1984 wonderful1984 added the type/bug This issue is a bug. label Aug 19, 2019
@wonderful1984 wonderful1984 changed the title 使用lightning导入数据到过程中,tidb无法链接pd 使用lightning导入数据后,tidb无法链接pd Aug 19, 2019
@wonderful1984
Copy link
Author

wonderful1984 commented Aug 19, 2019

kv log

[2019/08/19 09:42:49.201 +08:00] [INFO] [util.rs:175] ["heartbeat sender and receiver are stale, refreshing ..."]
[2019/08/19 09:42:49.930 +08:00] [WARN] [util.rs:194] ["updating PD client done"] [spend=730.436273ms]
[2019/08/19 09:42:51.930 +08:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(4), details: Some(\"Deadline Exceeded\") }))"]
[2019/08/19 09:42:51.930 +08:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] 
[2019/08/19 09:42:51.931 +08:00] [INFO] [subchannel.cc:841] ["New connected subchannel at 0x2ba17150ef80 for subchannel 0x2b7646aa3a00"]
[2019/08/19 09:42:51.932 +08:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] 
[2019/08/19 09:42:51.932 +08:00] [INFO] [util.rs:455] ["connected to PD leader"] 
[2019/08/19 09:42:51.932 +08:00] [INFO] [util.rs:175] ["heartbeat sender and receiver are stale, refreshing ..."]
[2019/08/19 09:42:52.656 +08:00] [WARN] [util.rs:194] ["updating PD client done"] [spend=725.599177ms]
[2019/08/19 09:42:54.656 +08:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(4), details: Some(\"Deadline Exceeded\") }))"]
[2019/08/19 09:42:54.656 +08:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] 
[2019/08/19 09:42:54.657 +08:00] [INFO] [subchannel.cc:841] ["New connected subchannel at 0x2c2ae8b541c0 for subchannel 0x2b7646ab8400"]
[2019/08/19 09:42:54.658 +08:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] 
[2019/08/19 09:42:54.658 +08:00] [INFO] [util.rs:455] ["connected to PD leader"] 
[2019/08/19 09:42:54.658 +08:00] [INFO] [util.rs:175] ["heartbeat sender and receiver are stale, refreshing ..."]
[2019/08/19 09:42:55.375 +08:00] [WARN] [util.rs:194] ["updating PD client done"] [spend=718.196067ms]
[2019/08/19 09:42:56.957 +08:00] [INFO] [store.rs:1223] ["merged peer receives a stale message"] [msg_type=MsgRequestPreVote] [current_region_epoch="RegionEpoch { conf_ver: 15, version: 5528, unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }"] [region_id=1061196]
[2019/08/19 09:42:56.957 +08:00] [INFO] [store.rs:312] ["raft message is stale, tell to gc"] [msg_type=MsgRequestPreVote] [current_region_epoch="RegionEpoch { conf_ver: 15, version: 5528, unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }"] [region_id=1061196]

@tiancaiamao
Copy link
Contributor

"[tikv:9001]PD server timeout"

Are you sure the PD server is running? @wonderful1984

@tiancaiamao tiancaiamao changed the title 使用lightning导入数据后,tidb无法链接pd After importing data using lightning, TiDB fails to connect PD Aug 19, 2019
@wonderful1984
Copy link
Author

PD 1 leader
[2019/08/17 21:14:22.517 +08:00] [WARN] [tso.go:162] ["clock offset"] [jet-lag=625.9951ms] [prev-physical=2019/08/17 21:14:21.891 +08:00] [now=2019/08/17 21:14:22.517 +08:00]
[2019/08/17 21:14:45.244 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=9792] [operator="\"merge-region (kind:leader,region,merge, region:9792(96,8), createAt:2019-08-17 21:04:44.854500066 +0800 CST m=+168884.260018674, startAt:2019-08-17 21:04:44.85464217 +0800 CST m=+168884.260160779, currentStep:4, steps:[transfer leader from store 6 to store 1 add learner peer 7060715 on store 801042 promote learner peer 7060715 on store 801042 to voter remove peer on store 6 add learner peer 7060716 on store 801041 promote learner peer 7060716 on store 801041 to voter remove peer on store 4 merge region 9792 into region 9645]) timeout\""]
[2019/08/17 21:14:45.244 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=9645] [operator="\"merge-region (kind:leader,region,merge, region:9645(88,14), createAt:2019-08-17 21:04:44.85450166 +0800 CST m=+168884.260020269, startAt:2019-08-17 21:04:44.854726147 +0800 CST m=+168884.260244871, currentStep:0, steps:[merge region 9792 into region 9645]) timeout\""]
[2019/08/17 21:15:32.744 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=302681] [operator="\"merge-region (kind:region,merge, region:302681(1648,5), createAt:2019-08-17 21:05:28.791592918 +0800 CST m=+168928.197111524, startAt:2019-08-17 21:05:28.791802259 +0800 CST m=+168928.197320867, currentStep:0, steps:[merge region 302505 into region 302681]) timeout\""]
[2019/08/17 21:15:32.745 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=302505] [operator="\"merge-region (kind:region,merge, region:302505(1645,11), createAt:2019-08-17 21:05:28.791591469 +0800 CST m=+168928.197110077, startAt:2019-08-17 21:05:28.791717791 +0800 CST m=+168928.197236399, currentStep:3, steps:[add learner peer 7060839 on store 4 promote learner peer 7060839 on store 4 to voter remove peer on store 801041 add learner peer 7060840 on store 6 promote learner peer 7060840 on store 6 to voter remove peer on store 801042 merge region 302505 into region 302681]) timeout\""]
[2019/08/17 22:29:22.757 +08:00] [WARN] [tso.go:162] ["clock offset"] [jet-lag=815.313976ms] [prev-physical=2019/08/17 22:29:21.941 +08:00] [now=2019/08/17 22:29:22.757 +08:00]
[2019/08/17 22:34:22.550 +08:00] [WARN] [tso.go:162] ["clock offset"] [jet-lag=608.640992ms] [prev-physical=2019/08/17 22:34:21.941 +08:00] [now=2019/08/17 22:34:22.550 +08:00]
[2019/08/17 22:39:22.483 +08:00] [WARN] [tso.go:162] ["clock offset"] [jet-lag=541.727388ms] [prev-physical=2019/08/17 22:39:21.941 +08:00] [now=2019/08/17 22:39:22.483 +08:00]
[2019/08/17 23:49:08.189 +08:00] [WARN] [raft.go:358] ["leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk"] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=603.176518ms]
[2019/08/19 00:09:52.834 +08:00] [WARN] [grpclog.go:60] ["transport: http2Server.HandleStreams failed to read frame: read tcp pd_leader:2379->lightning:41386: read: connection reset by peer"]

PD 2 follower
[2019/08/17 20:10:20.281 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2348794]
[2019/08/17 20:10:20.614 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=309.202189ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/17 20:10:25.079 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2348794] [took=4.488464414s]
[2019/08/17 21:10:20.583 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2350038]
[2019/08/17 21:10:20.978 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=354.571268ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/17 21:10:25.426 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2350038] [took=4.488168608s]
[2019/08/17 22:10:20.931 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2351316]
[2019/08/17 22:10:21.229 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=298.371665ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/17 22:10:25.686 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2351316] [took=4.457144061s]
[2019/08/17 23:10:21.205 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2352518]
[2019/08/17 23:10:21.492 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=286.663175ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []


PD 3 follower
[2019/08/17 19:10:24.748 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2347584] [took=4.485168367s]
[2019/08/17 20:10:20.279 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2348794]
[2019/08/17 20:10:20.583 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=303.897678ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/17 20:10:25.083 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2348794] [took=4.500436506s]
[2019/08/17 21:10:20.582 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2350038]
[2019/08/17 21:10:20.897 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=315.137035ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/17 21:10:25.402 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2350038] [took=4.504897045s]
[2019/08/17 22:10:20.929 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2351316]
[2019/08/17 22:10:21.225 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=295.918878ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/17 22:10:25.683 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2351316] [took=4.458739444s]
[2019/08/17 23:10:21.204 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2352518]
[2019/08/17 23:10:21.477 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=273.698022ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/17 23:10:25.939 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2352518] [took=4.461341314s]

@wonderful1984
Copy link
Author

wonderful1984 commented Aug 19, 2019

事件:导入历史数据
# 过程如下
2019/08/17 09:56:33.317 lightning导入A库 157G
2019/08/17 12:05:59.939 导完,耗时2小时 [info] restore all tables data takes 2h8m33.534074942s

2019/08/17 19:13:05.511 lightning导入B库 5.2T
2019/08/17 19:13:05.956 导入失败 [error] B库.B_table-schema.sql as auto: invalid schema encoding
2019/08/17 19:13:14.921  [error] [step 2] run cause error : table info B_table not found
# B库编码问题,先skip,后续再处理

2019/08/17 20:28:35.769 lightning导入C库 288G
# 处理了 30分钟后
2019/08/17 21:06:11.158  [info] [`C库`.`C_table`:1] flush kv deliver ...
2019/08/17 21:06:11.158  [info] [`C库`.`C_table`:1] [14b0834e-71da-52d8-8213-27434a0fd2e8] import
2019/08/17 21:06:45.276  [info] progress: 1238/1238 chunks (100.0%), 7/8 tables (87.5%), speed 129.37 MiB/s, post-processing
2019/08/17 21:09:05.420  [warning] cannot switch to Import mode: rpc error: code = Unknown desc = PdRPC(Other(StringError("[src/pd/util.rs:344]: fail to request")))
2019/08/17 21:09:05.432  [info] progress: 1238/1238 chunks (100.0%), 7/8 tables (87.5%), speed 121.87 MiB/s, post-processing
2019/08/17 21:09:45.276  [info] progress: 1238/1238 chunks (100.0%), 7/8 tables (87.5%), speed 119.90 MiB/s, post-processing
2019/08/17 21:10:45.276  [info] progress: 1238/1238 chunks (100.0%), 7/8 tables (87.5%), speed 117.04 MiB/s, post-processing
2019/08/17 21:11:45.276  [info] progress: 1238/1238 chunks (100.0%), 7/8 tables (87.5%), speed 114.32 MiB/s, post-processing
2019/08/17 21:12:45.276  [info] progress: 1238/1238 chunks (100.0%), 7/8 tables (87.5%), speed 111.72 MiB/s, post-processing
2019/08/17 21:14:05.308  [warning] cannot switch to Import mode: rpc error: code = Unknown desc = PdRPC(Other(StringError("[src/pd/util.rs:344]: fail to request")))
# 一直卡了29个小时
2019/08/19 01:30:46.760  [error] [`C库`.`C_table`] error [`C库`.`C_table`:1] [14b0834e-71da-52d8-8213-27434a0fd2e8] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed("retry 4 times still 312 ranges failed")
2019/08/19 01:30:46.760  [info] restore all tables data takes 29h2m1.484031341s
2019/08/19 01:30:46.760  [error] [step 2] run cause error : [`C库`.`C_table`:1] [14b0834e-71da-52d8-8213-27434a0fd2e8] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed("retry 4 times still 312 ranges failed")

@tiancaiamao
Copy link
Contributor

The PD leader seems not to function well.
PTAL @disksing

@disksing
Copy link
Contributor

Seems tikv-server cannot connect to pd-server, but pd-server's log looks fine. Could you try to access PD's api from tikv-server host to check if there are some problem with network? For example, you can curl http://pd-addr:2379/pd/api/v1/stores.

@wonderful1984
Copy link
Author

wonderful1984 commented Aug 23, 2019

no response

[tidb ~]$ timeout 30 curl http://pd_leader:2379/pd/api/v1/stores    
[tidb ~]$ 

@wonderful1984
Copy link
Author

strace

futex(0xc00026f648, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc01a28e148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1f33420, FUTEX_WAIT_PRIVATE, 0, {0, 99680522}) = -1 ETIMEDOUT (Connection timed out)
futex(0x1f2f3d0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1f2f2d0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc0000ab2c8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc000526f48, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1f33420, FUTEX_WAIT_PRIVATE, 0, {0, 99668602}) = -1 ETIMEDOUT (Connection timed out)
futex(0x1f2f3d0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1f2f2d0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc01a28e148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc00037d9c8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1f33420, FUTEX_WAIT_PRIVATE, 0, {0, 99646967}) = -1 ETIMEDOUT (Connection timed out)
futex(0x1f2f3d0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1f2f2d0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc0001a0bc8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc01a28e148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1f33420, FUTEX_WAIT_PRIVATE, 0, {0, 99634101}) = -1 ETIMEDOUT (Connection timed out)
futex(0x1f2f3d0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1f2f2d0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc01a28e148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc0001a0bc8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1f33420, FUTEX_WAIT_PRIVATE, 0, {0, 79819665}) = -1 ETIMEDOUT (Connection timed out)
futex(0x1f2f3d0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1f2f2d0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc0000aa4c8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc0000ab2c8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(4, {}, 128, 0, NULL)        = 0
nanosleep({0, 3000}, NULL)              = 0
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(4, {}, 128, 0, NULL)        = 0
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(4, {}, 128, 0, NULL)        = 0
futex(0xc0000ab2c8, FUTEX_WAKE_PRIVATE, 1) = 1
write(45, "\0\0\4\10\0\0\0\0\0\0\0\0\5\0\0\10\6\0\0\0\0\0\2\4\20\20\t\16\7\7", 30) = 30
nanosleep({0, 3000}, NULL)              = 0
futex(0xc0263fc148, FUTEX_WAKE_PRIVATE, 1) = 1
write(25, "\0\0\4\10\0\0\0\0\0\0\0\0\5\0\0\10\6\0\0\0\0\0\2\4\20\20\t\16\7\7\0\0"..., 343) = 343
epoll_pwait(4, {}, 128, 0, NULL)        = 0
write(24, "\0\0\10\6\1\0\0\0\0\2\4\20\20\t\16\7\7\0\0\4\10\0\0\0\0\0\0\0\1\32\0\0"..., 47) = 47
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(4, {}, 128, 0, NULL)        = 0
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc0263fc148, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
epoll_pwait(4, {{EPOLLIN|EPOLLOUT, {u32=3301687960, u64=47960906515096}}}, 128, 0, NULL) = 1
epoll_pwait(4, {{EPOLLIN|EPOLLOUT, {u32=3301687960, u64=47960906515096}}}, 128, -1, NULL) = 1
epoll_pwait(4, {{EPOLLIN, {u32=1136229736, u64=47950151122280}}}, 128, -1, NULL) = 1
accept4(7, {sa_family=AF_INET, sin_port=htons(64970), sin_addr=inet_addr("tidb-lightning")}, [16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 54
epoll_ctl(4, EPOLL_CTL_ADD, 54, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3301688792, u64=47960906515928}}) = 0
getsockname(54, {sa_family=AF_INET, sin_port=htons(2379), sin_addr=inet_addr("pd_leader")}, [16]) = 0
setsockopt(54, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(54, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(54, SOL_TCP, TCP_KEEPINTVL, [30], 4) = 0
setsockopt(54, SOL_TCP, TCP_KEEPIDLE, [30], 4) = 0
accept4(7, 0xc00a6217c8, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(4, {{EPOLLIN|EPOLLOUT, {u32=1136226824, u64=47950151119368}}}, 128, 0, NULL) = 1
read(25, "\0\0\4\10\0\0\0\0\0\0\0\3N\0\0\10\6\0\0\0\0\0\2\4\20\20\t\16\7\7", 32768) = 30
read(25, 0xc00d50a000, 32768)           = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(4, {}, 128, 0, NULL)        = 0
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL^CProcess 64298 detached
 

@wonderful1984
Copy link
Author

using tidb-lightning(v2.1.14 or v2.1.16),PD Always hang

leader

[2019/08/22 15:09:26.955 +08:00] [INFO] [id.go:110] ["idAllocator allocates a new id"] [alloc-id=7456000]
[2019/08/22 15:09:37.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454821] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 6 801042]} (kind:leader, 
region:7454821(6422,14), createAt:2019-08-22 15:09:26.485487154 +0800 CST m=+10585.313715320, startAt:2019-08-22 15:09:26.485538587 +0800 CST m=+10585.313766798, currentStep:0, steps:[transfer leader from s
tore 6 to store 1]) timeout\""]
[2019/08/22 15:09:37.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454847] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 6 801042]} (kind:leader, 
region:7454847(6427,14), createAt:2019-08-22 15:09:26.564182608 +0800 CST m=+10585.392410772, startAt:2019-08-22 15:09:26.564231306 +0800 CST m=+10585.392459469, currentStep:0, steps:[transfer leader from s
tore 6 to store 801042]) timeout\""]
[2019/08/22 15:09:38.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454873] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 6 801042]} (kind:leader, 
region:7454873(6432,14), createAt:2019-08-22 15:09:26.610276708 +0800 CST m=+10585.438504875, startAt:2019-08-22 15:09:26.610315702 +0800 CST m=+10585.438543867, currentStep:0, steps:[transfer leader from s
tore 6 to store 801042]) timeout\""]
[2019/08/22 15:09:38.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454899] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 6 801042]} (kind:leader, 
region:7454899(6437,14), createAt:2019-08-22 15:09:26.672318565 +0800 CST m=+10585.500546736, startAt:2019-08-22 15:09:26.672372973 +0800 CST m=+10585.500601161, currentStep:0, steps:[transfer leader from s
tore 6 to store 801042]) timeout\""]
[2019/08/22 15:09:38.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454925] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 6 801042]} (kind:leader, 
region:7454925(6442,14), createAt:2019-08-22 15:09:26.750802405 +0800 CST m=+10585.579030572, startAt:2019-08-22 15:09:26.750838817 +0800 CST m=+10585.579066987, currentStep:0, steps:[transfer leader from s
tore 6 to store 801042]) timeout\""]
[2019/08/22 15:09:38.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454951] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 6 801042]} (kind:leader, 
region:7454951(6447,14), createAt:2019-08-22 15:09:26.830746915 +0800 CST m=+10585.658975086, startAt:2019-08-22 15:09:26.83079873 +0800 CST m=+10585.659026911, currentStep:0, steps:[transfer leader from st
ore 6 to store 801042]) timeout\""]
[2019/08/22 15:09:41.347 +08:00] [ERROR] [heartbeat_streams.go:96] ["send heartbeat message fail"] [region-id=7454862] [error="rpc error: code = Canceled desc = context canceled"]
[2019/08/22 15:10:10.590 +08:00] [ERROR] [heartbeat_streams.go:120] ["send keepalive message fail"] [target-store-id=4] [error="rpc error: code = Canceled desc = context canceled"]
[2019/08/22 15:10:10.590 +08:00] [ERROR] [heartbeat_streams.go:120] ["send keepalive message fail"] [target-store-id=801041] [error=EOF]
[2019/08/22 15:10:10.590 +08:00] [ERROR] [heartbeat_streams.go:120] ["send keepalive message fail"] [target-store-id=1] [error="rpc error: code = Canceled desc = context canceled"]
[2019/08/22 15:10:10.590 +08:00] [ERROR] [heartbeat_streams.go:120] ["send keepalive message fail"] [target-store-id=801042] [error="rpc error: code = Canceled desc = context canceled"]
[2019/08/22 15:13:06.452 +08:00] [INFO] [periodic.go:135] ["starting auto periodic compaction"] [revision=2487416] [compact-period=1h0m0s]
[2019/08/22 15:13:06.452 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2487416]
[2019/08/22 15:13:06.735 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=282.720864ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/22 15:13:06.735 +08:00] [INFO] [periodic.go:146] ["completed auto periodic compaction"] [revision=2487416] [compact-period=1h0m0s] [took=1h0m0.284263895s]
[2019/08/22 15:13:11.205 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2487416] [took=4.47066026s]
[2019/08/22 15:19:27.846 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454888] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 4 801042]} (kind:leader,region, region:7454888(6435,14), createAt:2019-08-22 15:09:26.650907908 +0800 CST m=+10585.479136083, startAt:2019-08-22 15:09:26.650959207 +0800 CST m=+10585.479187488, currentStep:0, steps:[transfer leader from store 6 to store 801042, add learner peer 7454892 on store 4, promote learner peer 7454892 on store 4 to voter, remove peer on store 6]) timeout\""]
[2019/08/22 15:19:27.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454877] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 4 801041]} (kind:leader,region, region:7454877(6433,14), createAt:2019-08-22 15:09:26.628893756 +0800 CST m=+10585.457122032, startAt:2019-08-22 15:09:26.628948518 +0800 CST m=+10585.457176693, currentStep:0, steps:[add learner peer 7454881 on store 801041, promote learner peer 7454881 on store 801041 to voter, add learner peer 7454882 on store 4, promote learner peer 7454882 on store 4 to voter, remove peer on store 801042, transfer leader from store 6 to store 801041, remove peer on store 6]) timeout\""]


follower 1
[2019/08/22 14:13:06.439 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=273.290444ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/22 14:13:10.886 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2486216] [took=4.447072711s]
[2019/08/22 15:13:06.453 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2487416]
[2019/08/22 15:13:06.739 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=286.841404ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/22 15:13:11.197 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2487416] [took=4.458105144s]
[2019/08/22 16:13:06.736 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2488882]
[2019/08/22 16:13:07.020 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=283.75354ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/22 16:13:11.469 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2488882] [took=4.44851634s]


follower 2
[2019/08/22 14:13:10.898 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2486216] [took=4.453153275s]
[2019/08/22 15:13:06.452 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2487416]
[2019/08/22 15:13:06.726 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=273.873836ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/22 15:13:11.179 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2487416] [took=4.453444168s]
[2019/08/22 16:13:06.735 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2488882]
[2019/08/22 16:13:07.013 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=278.001665ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] []
[2019/08/22 16:13:11.460 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2488882] [took=4.447399151s]



all tikv 
[2019/08/22 15:09:59.590 +08:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [endpoints=pd_follower_2:2379]
[2019/08/22 15:09:59.591 +08:00] [INFO] [subchannel.cc:841] ["New connected subchannel at 0x2b7f11e482a0 for subchannel 0x2b7f1142ae00"]
[2019/08/22 15:09:59.591 +08:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [endpoints=pd_leader:2379]
[2019/08/22 15:09:59.592 +08:00] [INFO] [util.rs:455] ["connected to PD leader"] [endpoints=pd_leader:2379]
[2019/08/22 15:09:59.592 +08:00] [INFO] [util.rs:175] ["heartbeat sender and receiver are stale, refreshing ..."]
[2019/08/22 15:09:59.999 +08:00] [WARN] [util.rs:194] ["updating PD client done"] [spend=409.085905ms]
[2019/08/22 15:10:00.000 +08:00] [ERROR] [util.rs:286] ["request failed, retry"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(4), details: Some(\"Deadline Exceeded\") }))"]
[2019/08/22 15:10:00.000 +08:00] [ERROR] [util.rs:286] ["request failed, retry"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(4), details: Some(\"Deadline Exceeded\") }))"]
[2019/08/22 15:10:00.000 +08:00] [ERROR] [util.rs:286] ["request failed, retry"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(4), details: Some(\"Deadline Exceeded\") }))"]
[2019/08/22 15:10:00.000 +08:00] [INFO] [client.rs:339] ["cancel region heartbeat sender"]
[2019/08/22 15:10:00.077 +08:00] [ERROR] [util.rs:286] ["request failed, retry"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(4), details: Some(\"Deadline Exceeded\") }))"]

@wonderful1984
Copy link
Author

kill 都不行,要 -9

~]$ ps -ef|grep pd
apps 64274 1 21 Aug22 ? 03:35:35 /apps/svr/tidb3/bin/pd-server --name=

~]$ strace -p 64274
Process 64274 attached
futex(0x1f2fe88, FUTEX_WAIT_PRIVATE, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=90469, si_uid=500} ---
futex(0x1f4ee40, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigreturn() = 202
futex(0x1f2fe88, FUTEX_WAIT_PRIVATE, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=90469, si_uid=500} ---
futex(0x1f4ee40, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigreturn() = 202
futex(0x1f2fe88, FUTEX_WAIT_PRIVATE, 0, NULL

@wonderful1984
Copy link
Author

My solution:
1、kill -9 pd leader pid
2、Fresh elections
3、restart the pd

@disksing
Copy link
Contributor

Thanks @wonderful1984 , we will investigate the logs. It looks like a serious problem. By the way, if the problem shows again, could you help dump the goroutine to help us analysis? The command is:
curl "http://pd-addr:2379/debug/pprof/goroutine?debug=2"

@disksing
Copy link
Contributor

@wonderful1984 Could provide PD's version you are using?

@wonderful1984
Copy link
Author

wonderful1984 commented Aug 23, 2019

PD version
Release Version: v4.0.0-alpha-34-g2e401b60
Git Commit Hash: 2e401b6
Git Branch: master
UTC Build Time: 2019-08-21 06:35:07

TiDB version
Release Version: v4.0.0-alpha-109-g82ec3ca20
Git Commit Hash: 82ec3ca
Git Branch: master
UTC Build Time: 2019-08-22 01:11:56
GoVersion: go version go1.12 linux/amd64
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

TiKV version
Release Version: 4.0.0-alpha
Git Commit Hash: d9eeb314ea51819117f773b111e7d29b93435fd9
Git Commit Branch: master
UTC Build Time: 2019-08-21 12:33:47
Rust Version: rustc 1.38.0-nightly (311376d30 2019-07-18)

@disksing
Copy link
Contributor

Hi, @wonderful1984 , we have reproduced a similar issue in the recent master of PD tikv/pd#1707 . It is fixed in the latest master. Could you update PD and try again?

@wonderful1984
Copy link
Author

Is this deadlock only present in this latest release, or is it also present in 3.0.2 or 2.1.16?

@rleungx
Copy link
Member

rleungx commented Aug 29, 2019

@wonderful1984 This problem is only involved in the master branch, introduced by tikv/pd#1654 on Aug 8, 2019. And it has already been fixed by tikv/pd#1706 on Aug 26, 2019.

@wonderful1984
Copy link
Author

I've upgraded the system

@wonderful1984
Copy link
Author

pd_debug.log

好像有另外一个问题,看到使用lighning导入,又卡顿了

[apps@10.218.60.205(Unknow) ~]$ tail -100f /apps/svr/tidb2/bin/tidb-lightning.log|grep progress
[2019/08/29 20:20:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="135/9559 (1.4%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=66.16620179643499] [state=writing] [remaining=10h27m18s]
[2019/08/29 20:21:57.922 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=59.97592994809859] [state=writing] [remaining=10h13m57s]
[2019/08/29 20:22:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=54.51872688747679] [state=writing] [remaining=11h15m24s]
[2019/08/29 20:23:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=49.970267833060056] [state=writing] [remaining=12h16m53s]
[2019/08/29 20:24:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=46.1223130694202] [state=writing] [remaining=13h18m21s]
[2019/08/29 20:25:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=42.82460954942274] [state=writing] [remaining=14h19m50s]
[2019/08/29 20:26:57.906 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=39.96673540175886] [state=writing] [remaining=15h21m19s]
[2019/08/29 20:27:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=37.466909963212196] [state=writing] [remaining=16h22m47s]
[2019/08/29 20:28:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=35.26118164132799] [state=writing] [remaining=17h24m16s]
[2019/08/29 20:29:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=33.30072391486426] [state=writing] [remaining=18h25m44s]
[2019/08/29 20:30:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=31.546777184268475] [state=writing] [remaining=19h27m13s]
[2019/08/29 20:31:57.917 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=29.96792018350042] [state=writing] [remaining=20h28m43s]
[2019/08/29 20:32:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=28.540347202160625] [state=writing] [remaining=21h30m10s]
[2019/08/29 20:33:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=27.24224376893489] [state=writing] [remaining=22h31m39s]
[2019/08/29 20:34:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=26.057087014451653] [state=writing] [remaining=23h33m7s]
[2019/08/29 20:35:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=24.970749000988054] [state=writing] [remaining=24h34m36s]
============时间只增,不见有动静=======
[2019/08/29 21:12:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=9.821127794716194] [state=writing] [remaining=62h29m15s]

@wonderful1984 wonderful1984 reopened this Aug 29, 2019
@rleungx
Copy link
Member

rleungx commented Aug 30, 2019

Can you try curl http://pd-addr:2379/pd/api/v1/stores again?

@wonderful1984
Copy link
Author

curl http://pd-addr:2379/pd/api/v1/store 没有问题,我这边持续观察吧

[2019/08/29 21:38:50.860 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=66.554016828344] [state=writing] [remaining=9h13m16s]

[2019/08/29 22:55:50.861 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=7.985733109224674] [state=writing] [remaining=88h7m0s]
[2019/08/29 22:56:50.860 +08:00] [INFO] [restore.go:464] [progress] [files="177/9559 (1.9%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=8.516779609519327] [state=writing] [remaining=76h51m28s]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug This issue is a bug.
Projects
None yet
Development

No branches or pull requests

4 participants