New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After importing data using lightning, TiDB fails to connect PD #11767
Comments
kv log [2019/08/19 09:42:49.201 +08:00] [INFO] [util.rs:175] ["heartbeat sender and receiver are stale, refreshing ..."] [2019/08/19 09:42:49.930 +08:00] [WARN] [util.rs:194] ["updating PD client done"] [spend=730.436273ms] [2019/08/19 09:42:51.930 +08:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(4), details: Some(\"Deadline Exceeded\") }))"] [2019/08/19 09:42:51.930 +08:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [2019/08/19 09:42:51.931 +08:00] [INFO] [subchannel.cc:841] ["New connected subchannel at 0x2ba17150ef80 for subchannel 0x2b7646aa3a00"] [2019/08/19 09:42:51.932 +08:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [2019/08/19 09:42:51.932 +08:00] [INFO] [util.rs:455] ["connected to PD leader"] [2019/08/19 09:42:51.932 +08:00] [INFO] [util.rs:175] ["heartbeat sender and receiver are stale, refreshing ..."] [2019/08/19 09:42:52.656 +08:00] [WARN] [util.rs:194] ["updating PD client done"] [spend=725.599177ms] [2019/08/19 09:42:54.656 +08:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(4), details: Some(\"Deadline Exceeded\") }))"] [2019/08/19 09:42:54.656 +08:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [2019/08/19 09:42:54.657 +08:00] [INFO] [subchannel.cc:841] ["New connected subchannel at 0x2c2ae8b541c0 for subchannel 0x2b7646ab8400"] [2019/08/19 09:42:54.658 +08:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [2019/08/19 09:42:54.658 +08:00] [INFO] [util.rs:455] ["connected to PD leader"] [2019/08/19 09:42:54.658 +08:00] [INFO] [util.rs:175] ["heartbeat sender and receiver are stale, refreshing ..."] [2019/08/19 09:42:55.375 +08:00] [WARN] [util.rs:194] ["updating PD client done"] [spend=718.196067ms] [2019/08/19 09:42:56.957 +08:00] [INFO] [store.rs:1223] ["merged peer receives a stale message"] [msg_type=MsgRequestPreVote] [current_region_epoch="RegionEpoch { conf_ver: 15, version: 5528, unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }"] [region_id=1061196] [2019/08/19 09:42:56.957 +08:00] [INFO] [store.rs:312] ["raft message is stale, tell to gc"] [msg_type=MsgRequestPreVote] [current_region_epoch="RegionEpoch { conf_ver: 15, version: 5528, unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }"] [region_id=1061196] |
Are you sure the |
PD 1 leader [2019/08/17 21:14:22.517 +08:00] [WARN] [tso.go:162] ["clock offset"] [jet-lag=625.9951ms] [prev-physical=2019/08/17 21:14:21.891 +08:00] [now=2019/08/17 21:14:22.517 +08:00] [2019/08/17 21:14:45.244 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=9792] [operator="\"merge-region (kind:leader,region,merge, region:9792(96,8), createAt:2019-08-17 21:04:44.854500066 +0800 CST m=+168884.260018674, startAt:2019-08-17 21:04:44.85464217 +0800 CST m=+168884.260160779, currentStep:4, steps:[transfer leader from store 6 to store 1 add learner peer 7060715 on store 801042 promote learner peer 7060715 on store 801042 to voter remove peer on store 6 add learner peer 7060716 on store 801041 promote learner peer 7060716 on store 801041 to voter remove peer on store 4 merge region 9792 into region 9645]) timeout\""] [2019/08/17 21:14:45.244 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=9645] [operator="\"merge-region (kind:leader,region,merge, region:9645(88,14), createAt:2019-08-17 21:04:44.85450166 +0800 CST m=+168884.260020269, startAt:2019-08-17 21:04:44.854726147 +0800 CST m=+168884.260244871, currentStep:0, steps:[merge region 9792 into region 9645]) timeout\""] [2019/08/17 21:15:32.744 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=302681] [operator="\"merge-region (kind:region,merge, region:302681(1648,5), createAt:2019-08-17 21:05:28.791592918 +0800 CST m=+168928.197111524, startAt:2019-08-17 21:05:28.791802259 +0800 CST m=+168928.197320867, currentStep:0, steps:[merge region 302505 into region 302681]) timeout\""] [2019/08/17 21:15:32.745 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=302505] [operator="\"merge-region (kind:region,merge, region:302505(1645,11), createAt:2019-08-17 21:05:28.791591469 +0800 CST m=+168928.197110077, startAt:2019-08-17 21:05:28.791717791 +0800 CST m=+168928.197236399, currentStep:3, steps:[add learner peer 7060839 on store 4 promote learner peer 7060839 on store 4 to voter remove peer on store 801041 add learner peer 7060840 on store 6 promote learner peer 7060840 on store 6 to voter remove peer on store 801042 merge region 302505 into region 302681]) timeout\""] [2019/08/17 22:29:22.757 +08:00] [WARN] [tso.go:162] ["clock offset"] [jet-lag=815.313976ms] [prev-physical=2019/08/17 22:29:21.941 +08:00] [now=2019/08/17 22:29:22.757 +08:00] [2019/08/17 22:34:22.550 +08:00] [WARN] [tso.go:162] ["clock offset"] [jet-lag=608.640992ms] [prev-physical=2019/08/17 22:34:21.941 +08:00] [now=2019/08/17 22:34:22.550 +08:00] [2019/08/17 22:39:22.483 +08:00] [WARN] [tso.go:162] ["clock offset"] [jet-lag=541.727388ms] [prev-physical=2019/08/17 22:39:21.941 +08:00] [now=2019/08/17 22:39:22.483 +08:00] [2019/08/17 23:49:08.189 +08:00] [WARN] [raft.go:358] ["leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk"] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=603.176518ms] [2019/08/19 00:09:52.834 +08:00] [WARN] [grpclog.go:60] ["transport: http2Server.HandleStreams failed to read frame: read tcp pd_leader:2379->lightning:41386: read: connection reset by peer"] PD 2 follower [2019/08/17 20:10:20.281 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2348794] [2019/08/17 20:10:20.614 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=309.202189ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/17 20:10:25.079 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2348794] [took=4.488464414s] [2019/08/17 21:10:20.583 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2350038] [2019/08/17 21:10:20.978 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=354.571268ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/17 21:10:25.426 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2350038] [took=4.488168608s] [2019/08/17 22:10:20.931 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2351316] [2019/08/17 22:10:21.229 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=298.371665ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/17 22:10:25.686 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2351316] [took=4.457144061s] [2019/08/17 23:10:21.205 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2352518] [2019/08/17 23:10:21.492 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=286.663175ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] PD 3 follower [2019/08/17 19:10:24.748 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2347584] [took=4.485168367s] [2019/08/17 20:10:20.279 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2348794] [2019/08/17 20:10:20.583 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=303.897678ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/17 20:10:25.083 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2348794] [took=4.500436506s] [2019/08/17 21:10:20.582 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2350038] [2019/08/17 21:10:20.897 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=315.137035ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/17 21:10:25.402 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2350038] [took=4.504897045s] [2019/08/17 22:10:20.929 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2351316] [2019/08/17 22:10:21.225 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=295.918878ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/17 22:10:25.683 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2351316] [took=4.458739444s] [2019/08/17 23:10:21.204 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2352518] [2019/08/17 23:10:21.477 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=273.698022ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/17 23:10:25.939 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2352518] [took=4.461341314s] |
事件:导入历史数据 # 过程如下 2019/08/17 09:56:33.317 lightning导入A库 157G 2019/08/17 12:05:59.939 导完,耗时2小时 [info] restore all tables data takes 2h8m33.534074942s 2019/08/17 19:13:05.511 lightning导入B库 5.2T 2019/08/17 19:13:05.956 导入失败 [error] B库.B_table-schema.sql as auto: invalid schema encoding 2019/08/17 19:13:14.921 [error] [step 2] run cause error : table info B_table not found # B库编码问题,先skip,后续再处理 2019/08/17 20:28:35.769 lightning导入C库 288G # 处理了 30分钟后 2019/08/17 21:06:11.158 [info] [`C库`.`C_table`:1] flush kv deliver ... 2019/08/17 21:06:11.158 [info] [`C库`.`C_table`:1] [14b0834e-71da-52d8-8213-27434a0fd2e8] import 2019/08/17 21:06:45.276 [info] progress: 1238/1238 chunks (100.0%), 7/8 tables (87.5%), speed 129.37 MiB/s, post-processing 2019/08/17 21:09:05.420 [warning] cannot switch to Import mode: rpc error: code = Unknown desc = PdRPC(Other(StringError("[src/pd/util.rs:344]: fail to request"))) 2019/08/17 21:09:05.432 [info] progress: 1238/1238 chunks (100.0%), 7/8 tables (87.5%), speed 121.87 MiB/s, post-processing 2019/08/17 21:09:45.276 [info] progress: 1238/1238 chunks (100.0%), 7/8 tables (87.5%), speed 119.90 MiB/s, post-processing 2019/08/17 21:10:45.276 [info] progress: 1238/1238 chunks (100.0%), 7/8 tables (87.5%), speed 117.04 MiB/s, post-processing 2019/08/17 21:11:45.276 [info] progress: 1238/1238 chunks (100.0%), 7/8 tables (87.5%), speed 114.32 MiB/s, post-processing 2019/08/17 21:12:45.276 [info] progress: 1238/1238 chunks (100.0%), 7/8 tables (87.5%), speed 111.72 MiB/s, post-processing 2019/08/17 21:14:05.308 [warning] cannot switch to Import mode: rpc error: code = Unknown desc = PdRPC(Other(StringError("[src/pd/util.rs:344]: fail to request"))) # 一直卡了29个小时 2019/08/19 01:30:46.760 [error] [`C库`.`C_table`] error [`C库`.`C_table`:1] [14b0834e-71da-52d8-8213-27434a0fd2e8] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed("retry 4 times still 312 ranges failed") 2019/08/19 01:30:46.760 [info] restore all tables data takes 29h2m1.484031341s 2019/08/19 01:30:46.760 [error] [step 2] run cause error : [`C库`.`C_table`:1] [14b0834e-71da-52d8-8213-27434a0fd2e8] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed("retry 4 times still 312 ranges failed") |
The PD leader seems not to function well. |
Seems tikv-server cannot connect to pd-server, but pd-server's log looks fine. Could you try to access PD's api from tikv-server host to check if there are some problem with network? For example, you can |
no response [tidb ~]$ timeout 30 curl http://pd_leader:2379/pd/api/v1/stores [tidb ~]$ |
strace futex(0xc00026f648, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc01a28e148, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1f33420, FUTEX_WAIT_PRIVATE, 0, {0, 99680522}) = -1 ETIMEDOUT (Connection timed out) futex(0x1f2f3d0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1f2f2d0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc0000ab2c8, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc000526f48, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1f33420, FUTEX_WAIT_PRIVATE, 0, {0, 99668602}) = -1 ETIMEDOUT (Connection timed out) futex(0x1f2f3d0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1f2f2d0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc01a28e148, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc00037d9c8, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1f33420, FUTEX_WAIT_PRIVATE, 0, {0, 99646967}) = -1 ETIMEDOUT (Connection timed out) futex(0x1f2f3d0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1f2f2d0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc0001a0bc8, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc01a28e148, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1f33420, FUTEX_WAIT_PRIVATE, 0, {0, 99634101}) = -1 ETIMEDOUT (Connection timed out) futex(0x1f2f3d0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1f2f2d0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc01a28e148, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc0001a0bc8, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1f33420, FUTEX_WAIT_PRIVATE, 0, {0, 79819665}) = -1 ETIMEDOUT (Connection timed out) futex(0x1f2f3d0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x1f2f2d0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc0000aa4c8, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc0000ab2c8, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 epoll_pwait(4, {}, 128, 0, NULL) = 0 nanosleep({0, 3000}, NULL) = 0 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 epoll_pwait(4, {}, 128, 0, NULL) = 0 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 epoll_pwait(4, {}, 128, 0, NULL) = 0 futex(0xc0000ab2c8, FUTEX_WAKE_PRIVATE, 1) = 1 write(45, "\0\0\4\10\0\0\0\0\0\0\0\0\5\0\0\10\6\0\0\0\0\0\2\4\20\20\t\16\7\7", 30) = 30 nanosleep({0, 3000}, NULL) = 0 futex(0xc0263fc148, FUTEX_WAKE_PRIVATE, 1) = 1 write(25, "\0\0\4\10\0\0\0\0\0\0\0\0\5\0\0\10\6\0\0\0\0\0\2\4\20\20\t\16\7\7\0\0"..., 343) = 343 epoll_pwait(4, {}, 128, 0, NULL) = 0 write(24, "\0\0\10\6\1\0\0\0\0\2\4\20\20\t\16\7\7\0\0\4\10\0\0\0\0\0\0\0\1\32\0\0"..., 47) = 47 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 epoll_pwait(4, {}, 128, 0, NULL) = 0 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xc0263fc148, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 epoll_pwait(4, {{EPOLLIN|EPOLLOUT, {u32=3301687960, u64=47960906515096}}}, 128, 0, NULL) = 1 epoll_pwait(4, {{EPOLLIN|EPOLLOUT, {u32=3301687960, u64=47960906515096}}}, 128, -1, NULL) = 1 epoll_pwait(4, {{EPOLLIN, {u32=1136229736, u64=47950151122280}}}, 128, -1, NULL) = 1 accept4(7, {sa_family=AF_INET, sin_port=htons(64970), sin_addr=inet_addr("tidb-lightning")}, [16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 54 epoll_ctl(4, EPOLL_CTL_ADD, 54, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3301688792, u64=47960906515928}}) = 0 getsockname(54, {sa_family=AF_INET, sin_port=htons(2379), sin_addr=inet_addr("pd_leader")}, [16]) = 0 setsockopt(54, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(54, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 setsockopt(54, SOL_TCP, TCP_KEEPINTVL, [30], 4) = 0 setsockopt(54, SOL_TCP, TCP_KEEPIDLE, [30], 4) = 0 accept4(7, 0xc00a6217c8, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) epoll_pwait(4, {{EPOLLIN|EPOLLOUT, {u32=1136226824, u64=47950151119368}}}, 128, 0, NULL) = 1 read(25, "\0\0\4\10\0\0\0\0\0\0\0\3N\0\0\10\6\0\0\0\0\0\2\4\20\20\t\16\7\7", 32768) = 30 read(25, 0xc00d50a000, 32768) = -1 EAGAIN (Resource temporarily unavailable) epoll_pwait(4, {}, 128, 0, NULL) = 0 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL) = 0 futex(0xc00037cbc8, FUTEX_WAIT_PRIVATE, 0, NULL^CProcess 64298 detached |
using tidb-lightning(v2.1.14 or v2.1.16),PD Always hang leader [2019/08/22 15:09:26.955 +08:00] [INFO] [id.go:110] ["idAllocator allocates a new id"] [alloc-id=7456000] [2019/08/22 15:09:37.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454821] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 6 801042]} (kind:leader, region:7454821(6422,14), createAt:2019-08-22 15:09:26.485487154 +0800 CST m=+10585.313715320, startAt:2019-08-22 15:09:26.485538587 +0800 CST m=+10585.313766798, currentStep:0, steps:[transfer leader from s tore 6 to store 1]) timeout\""] [2019/08/22 15:09:37.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454847] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 6 801042]} (kind:leader, region:7454847(6427,14), createAt:2019-08-22 15:09:26.564182608 +0800 CST m=+10585.392410772, startAt:2019-08-22 15:09:26.564231306 +0800 CST m=+10585.392459469, currentStep:0, steps:[transfer leader from s tore 6 to store 801042]) timeout\""] [2019/08/22 15:09:38.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454873] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 6 801042]} (kind:leader, region:7454873(6432,14), createAt:2019-08-22 15:09:26.610276708 +0800 CST m=+10585.438504875, startAt:2019-08-22 15:09:26.610315702 +0800 CST m=+10585.438543867, currentStep:0, steps:[transfer leader from s tore 6 to store 801042]) timeout\""] [2019/08/22 15:09:38.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454899] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 6 801042]} (kind:leader, region:7454899(6437,14), createAt:2019-08-22 15:09:26.672318565 +0800 CST m=+10585.500546736, startAt:2019-08-22 15:09:26.672372973 +0800 CST m=+10585.500601161, currentStep:0, steps:[transfer leader from s tore 6 to store 801042]) timeout\""] [2019/08/22 15:09:38.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454925] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 6 801042]} (kind:leader, region:7454925(6442,14), createAt:2019-08-22 15:09:26.750802405 +0800 CST m=+10585.579030572, startAt:2019-08-22 15:09:26.750838817 +0800 CST m=+10585.579066987, currentStep:0, steps:[transfer leader from s tore 6 to store 801042]) timeout\""] [2019/08/22 15:09:38.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454951] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 6 801042]} (kind:leader, region:7454951(6447,14), createAt:2019-08-22 15:09:26.830746915 +0800 CST m=+10585.658975086, startAt:2019-08-22 15:09:26.83079873 +0800 CST m=+10585.659026911, currentStep:0, steps:[transfer leader from st ore 6 to store 801042]) timeout\""] [2019/08/22 15:09:41.347 +08:00] [ERROR] [heartbeat_streams.go:96] ["send heartbeat message fail"] [region-id=7454862] [error="rpc error: code = Canceled desc = context canceled"] [2019/08/22 15:10:10.590 +08:00] [ERROR] [heartbeat_streams.go:120] ["send keepalive message fail"] [target-store-id=4] [error="rpc error: code = Canceled desc = context canceled"] [2019/08/22 15:10:10.590 +08:00] [ERROR] [heartbeat_streams.go:120] ["send keepalive message fail"] [target-store-id=801041] [error=EOF] [2019/08/22 15:10:10.590 +08:00] [ERROR] [heartbeat_streams.go:120] ["send keepalive message fail"] [target-store-id=1] [error="rpc error: code = Canceled desc = context canceled"] [2019/08/22 15:10:10.590 +08:00] [ERROR] [heartbeat_streams.go:120] ["send keepalive message fail"] [target-store-id=801042] [error="rpc error: code = Canceled desc = context canceled"] [2019/08/22 15:13:06.452 +08:00] [INFO] [periodic.go:135] ["starting auto periodic compaction"] [revision=2487416] [compact-period=1h0m0s] [2019/08/22 15:13:06.452 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2487416] [2019/08/22 15:13:06.735 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=282.720864ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/22 15:13:06.735 +08:00] [INFO] [periodic.go:146] ["completed auto periodic compaction"] [revision=2487416] [compact-period=1h0m0s] [took=1h0m0.284263895s] [2019/08/22 15:13:11.205 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2487416] [took=4.47066026s] [2019/08/22 15:19:27.846 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454888] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 4 801042]} (kind:leader,region, region:7454888(6435,14), createAt:2019-08-22 15:09:26.650907908 +0800 CST m=+10585.479136083, startAt:2019-08-22 15:09:26.650959207 +0800 CST m=+10585.479187488, currentStep:0, steps:[transfer leader from store 6 to store 801042, add learner peer 7454892 on store 4, promote learner peer 7454892 on store 4 to voter, remove peer on store 6]) timeout\""] [2019/08/22 15:19:27.847 +08:00] [INFO] [operator_controller.go:108] ["operator timeout"] [region-id=7454877] [operator="\"scatter-region {scatter region: stores [1 6 801042] to [1 4 801041]} (kind:leader,region, region:7454877(6433,14), createAt:2019-08-22 15:09:26.628893756 +0800 CST m=+10585.457122032, startAt:2019-08-22 15:09:26.628948518 +0800 CST m=+10585.457176693, currentStep:0, steps:[add learner peer 7454881 on store 801041, promote learner peer 7454881 on store 801041 to voter, add learner peer 7454882 on store 4, promote learner peer 7454882 on store 4 to voter, remove peer on store 801042, transfer leader from store 6 to store 801041, remove peer on store 6]) timeout\""] follower 1 [2019/08/22 14:13:06.439 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=273.290444ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/22 14:13:10.886 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2486216] [took=4.447072711s] [2019/08/22 15:13:06.453 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2487416] [2019/08/22 15:13:06.739 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=286.841404ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/22 15:13:11.197 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2487416] [took=4.458105144s] [2019/08/22 16:13:06.736 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2488882] [2019/08/22 16:13:07.020 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=283.75354ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/22 16:13:11.469 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2488882] [took=4.44851634s] follower 2 [2019/08/22 14:13:10.898 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2486216] [took=4.453153275s] [2019/08/22 15:13:06.452 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2487416] [2019/08/22 15:13:06.726 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=273.873836ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/22 15:13:11.179 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2487416] [took=4.453444168s] [2019/08/22 16:13:06.735 +08:00] [INFO] [index.go:190] ["compact tree index"] [revision=2488882] [2019/08/22 16:13:07.013 +08:00] [WARN] [util.go:144] ["apply request took too long"] [took=278.001665ms] [expected-duration=100ms] [prefix=] [request="header: compaction: "] [response=size:7] [] [2019/08/22 16:13:11.460 +08:00] [INFO] [kvstore_compaction.go:57] ["finished scheduled compaction"] [compact-revision=2488882] [took=4.447399151s] all tikv [2019/08/22 15:09:59.590 +08:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [endpoints=pd_follower_2:2379] [2019/08/22 15:09:59.591 +08:00] [INFO] [subchannel.cc:841] ["New connected subchannel at 0x2b7f11e482a0 for subchannel 0x2b7f1142ae00"] [2019/08/22 15:09:59.591 +08:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [endpoints=pd_leader:2379] [2019/08/22 15:09:59.592 +08:00] [INFO] [util.rs:455] ["connected to PD leader"] [endpoints=pd_leader:2379] [2019/08/22 15:09:59.592 +08:00] [INFO] [util.rs:175] ["heartbeat sender and receiver are stale, refreshing ..."] [2019/08/22 15:09:59.999 +08:00] [WARN] [util.rs:194] ["updating PD client done"] [spend=409.085905ms] [2019/08/22 15:10:00.000 +08:00] [ERROR] [util.rs:286] ["request failed, retry"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(4), details: Some(\"Deadline Exceeded\") }))"] [2019/08/22 15:10:00.000 +08:00] [ERROR] [util.rs:286] ["request failed, retry"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(4), details: Some(\"Deadline Exceeded\") }))"] [2019/08/22 15:10:00.000 +08:00] [ERROR] [util.rs:286] ["request failed, retry"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(4), details: Some(\"Deadline Exceeded\") }))"] [2019/08/22 15:10:00.000 +08:00] [INFO] [client.rs:339] ["cancel region heartbeat sender"] [2019/08/22 15:10:00.077 +08:00] [ERROR] [util.rs:286] ["request failed, retry"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(4), details: Some(\"Deadline Exceeded\") }))"] |
kill 都不行,要 -9 ~]$ strace -p 64274 |
My solution: |
Thanks @wonderful1984 , we will investigate the logs. It looks like a serious problem. By the way, if the problem shows again, could you help dump the goroutine to help us analysis? The command is: |
@wonderful1984 Could provide PD's version you are using? |
PD version TiDB version TiKV version |
Hi, @wonderful1984 , we have reproduced a similar issue in the recent master of PD tikv/pd#1707 . It is fixed in the latest master. Could you update PD and try again? |
Is this deadlock only present in this latest release, or is it also present in 3.0.2 or 2.1.16? |
@wonderful1984 This problem is only involved in the master branch, introduced by tikv/pd#1654 on Aug 8, 2019. And it has already been fixed by tikv/pd#1706 on Aug 26, 2019. |
I've upgraded the system |
好像有另外一个问题,看到使用lighning导入,又卡顿了 [apps@10.218.60.205(Unknow) ~]$ tail -100f /apps/svr/tidb2/bin/tidb-lightning.log|grep progress [2019/08/29 20:20:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="135/9559 (1.4%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=66.16620179643499] [state=writing] [remaining=10h27m18s] [2019/08/29 20:21:57.922 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=59.97592994809859] [state=writing] [remaining=10h13m57s] [2019/08/29 20:22:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=54.51872688747679] [state=writing] [remaining=11h15m24s] [2019/08/29 20:23:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=49.970267833060056] [state=writing] [remaining=12h16m53s] [2019/08/29 20:24:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=46.1223130694202] [state=writing] [remaining=13h18m21s] [2019/08/29 20:25:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=42.82460954942274] [state=writing] [remaining=14h19m50s] [2019/08/29 20:26:57.906 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=39.96673540175886] [state=writing] [remaining=15h21m19s] [2019/08/29 20:27:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=37.466909963212196] [state=writing] [remaining=16h22m47s] [2019/08/29 20:28:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=35.26118164132799] [state=writing] [remaining=17h24m16s] [2019/08/29 20:29:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=33.30072391486426] [state=writing] [remaining=18h25m44s] [2019/08/29 20:30:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=31.546777184268475] [state=writing] [remaining=19h27m13s] [2019/08/29 20:31:57.917 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=29.96792018350042] [state=writing] [remaining=20h28m43s] [2019/08/29 20:32:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=28.540347202160625] [state=writing] [remaining=21h30m10s] [2019/08/29 20:33:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=27.24224376893489] [state=writing] [remaining=22h31m39s] [2019/08/29 20:34:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=26.057087014451653] [state=writing] [remaining=23h33m7s] [2019/08/29 20:35:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=24.970749000988054] [state=writing] [remaining=24h34m36s] ============时间只增,不见有动静======= [2019/08/29 21:12:57.900 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=9.821127794716194] [state=writing] [remaining=62h29m15s] |
Can you try |
curl http://pd-addr:2379/pd/api/v1/store 没有问题,我这边持续观察吧 [2019/08/29 21:38:50.860 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=66.554016828344] [state=writing] [remaining=9h13m16s] [2019/08/29 22:55:50.861 +08:00] [INFO] [restore.go:464] [progress] [files="153/9559 (1.6%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=7.985733109224674] [state=writing] [remaining=88h7m0s] [2019/08/29 22:56:50.860 +08:00] [INFO] [restore.go:464] [progress] [files="177/9559 (1.9%)"] [tables="2/14 (14.3%)"] [speed(MiB/s)=8.516779609519327] [state=writing] [remaining=76h51m28s] |
1、使用lightning导入数据后,tidb无法链接pd
["Release Version"=v4.0.0-alpha-74-gdb6c36c75] ["Git Commit Hash"=db6c36c75f64d2f6d92f6ef5b810b96ce6a42f09] ["Git Branch"=master]
2、重启tidb,报错无法启动
The text was updated successfully, but these errors were encountered: