Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightning failed with '503 Service Unavailable' when inject pd leader network partition #49744

Closed
Lily2025 opened this issue Dec 25, 2023 · 3 comments · Fixed by #49860
Closed
Assignees
Labels
component/lightning This issue is related to Lightning of TiDB. severity/major type/bug The issue is confirmed as a bug.

Comments

@Lily2025
Copy link

Lily2025 commented Dec 25, 2023

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、run lightning
2、inject pd leader network partition from all other pod

2. What did you expect to see? (Required)

lightning can success whenpd leader network partition

3. What did you see instead (Required)

lightning failed when inject pd leader network partition

Verbose debug logs will be written to /tmp/lightning.log.2023-12-23T02.30.44Z
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
| # | CHECK ITEM | TYPE | PASSED |
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
| 1 | Source data files size is proper | performance | true |
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
| 2 | the checkpoints are valid | critical | true |
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
| 3 | table schemas are valid | critical | true |
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
| 4 | all importing tables on the target are empty | critical | true |
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
| 5 | Cluster version check passed | critical | true |
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
| 6 | Lightning has the correct storage permission | critical | true |
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
| 7 | local disk resources are rich, estimate sorted data size 26.05GiB, local available is 3.399TiB | critical | true |
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
| 8 | The storage space is rich, which TiKV/Tiflash is 5.368TiB/0B. The estimated storage space is 78.16GiB/0B. | performance | true |
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
| 9 | Cluster doesn't have too many empty regions | performance | true |
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
| 10 | Cluster region distribution is balanced | performance | true |
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
| 11 | no CDC or PiTR task found | critical | true |
+----+-----------------------------------------------------------------------------------------------------------+-------------+--------+
{"level":"warn","ts":"2023-12-23T02:32:12.363771Z","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001f321c0/tc-pd-2.tc-pd-peer.ha-test-lightning-tps-5340957-1-769.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2023-12-23T02:32:12.364291Z","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001ce6000/tc-pd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2023-12-23T02:32:28.370028Z","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001f321c0/tc-pd-2.tc-pd-peer.ha-test-lightning-tps-5340957-1-769.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2023-12-23T02:32:48.400951Z","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001ce6000/tc-pd:2379","attempt":0,"error":"rpc error: code = Unknown desc = context deadline exceeded"}
{"level":"warn","ts":"2023-12-23T02:35:54.402124Z","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001f321c0/tc-pd-2.tc-pd-peer.ha-test-lightning-tps-5340957-1-769.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
tidb lightning encountered error: [Lightning:Restore:ErrRestoreTable]restore table sysbench.user_data1 failed: request pd http api failed with status: '503 Service Unavailable'

tidb logs:
tidb-0.log
tidb-1.log

request pd http api failed with status: '503 Service Unavailable'
github.com/tikv/pd/client/http.(*clientInner).doRequest
/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20231219031951-25f48f0bdd27/http/client.go:242
github.com/tikv/pd/client/http.(*clientInner).requestWithRetry
/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20231219031951-25f48f0bdd27/http/client.go:156
github.com/tikv/pd/client/http.(*client).request
/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20231219031951-25f48f0bdd27/http/client.go:414
github.com/tikv/pd/client/http.(*client).SetRegionLabelRule
/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20231219031951-25f48f0bdd27/http/interface.go:540
github.com/pingcap/tidb/br/pkg/pdutil.pauseSchedulerByKeyRangeWithTTL
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/pdutil/pd.go:1002
github.com/pingcap/tidb/br/pkg/pdutil.PauseSchedulersByKeyRange
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/pdutil/pd.go:970
github.com/pingcap/tidb/br/pkg/lightning/backend/local.(*Backend).ImportEngine
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/backend/local/local.go:1564
github.com/pingcap/tidb/br/pkg/lightning/backend.(*ClosedEngine).Import
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/backend/backend.go:366
github.com/pingcap/tidb/br/pkg/lightning/importer.(*TableImporter).importKV
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/importer/table_import.go:1343
github.com/pingcap/tidb/br/pkg/lightning/importer.(*TableImporter).importEngine
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/importer/table_import.go:919
github.com/pingcap/tidb/br/pkg/lightning/importer.(*TableImporter).importEngines.func3
/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/importer/table_import.go:525
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1650

4. What is your TiDB version? (Required)

./tidb-server -V
Release Version: v7.6.0-alpha
Edition: Community
Git Commit Hash: 5c279d8
Git Branch: heads/refs/tags/v7.6.0-alpha
UTC Build Time: 2023-12-21 07:56:10
GoVersion: go1.21.5
Race Enabled: false
Check Table Before Drop: false
Store: unistore
2023-12-23T04:16:24.782+0800

@Lily2025 Lily2025 added the type/bug The issue is confirmed as a bug. label Dec 25, 2023
@Lily2025
Copy link
Author

/type bug
/severity major
/assign lance6716

@Lily2025
Copy link
Author

need pd fix:tikv/pd#7613

@Lily2025 Lily2025 changed the title lightning failed when inject pd leader network partition lightning failed with '503 Service Unavailable' when inject pd leader network partition Dec 26, 2023
@Lily2025
Copy link
Author

another case:add index failed with “Error 1105 (HY000): request pd http api failed with status: '503 Service Unavailable'” when rolling restart pd or kill pdleader

kill pdleader:
add index failed at 2023-12-26 02:13:57 (Error 1105 (HY000): request pd http api failed with status: '503 Service Unavailable')
operatorLogs:
[2023-12-26 02:11:06] ###### start adding index
alter table sbtest1 add index index_test_1703527866961 (c)
[2023-12-26 02:11:06] ###### wait for ddl job finish

pd rolling restart:
add index failed at 2023-12-26 08:28:01 (Error 1105 (HY000): request pd http api failed with status: '503 Service Unavailable')
operatorLogs:
[2023-12-26 08:26:55] ###### start adding index
alter table sbtest1 add index index_test_1703550415547 (c)
[2023-12-26 08:26:55] ###### wait for ddl job finish

tidb logs:
tidb-0.log
tidb-1.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/lightning This issue is related to Lightning of TiDB. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants