Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validators cannot discover P2P peers when running as StatefulSet in k8s #2378

Closed
mazzy89 opened this issue Jun 18, 2024 · 4 comments
Closed

Comments

@mazzy89
Copy link
Contributor

mazzy89 commented Jun 18, 2024

Validators cannot discover P2P peers when running as StatefulSet in k8s

Description

In a multi-node scenario, when a validator is started having configured under p2p.persistent_peers and p2p.seeds a list of nodes, the DNS lookup fails. This is an issue suffered by other similar products such as RabbitMQ which during the bootstrap phase, they try to reach other nodes/peers. See here kubernetes/kubernetes#92559 (comment)

Your environment

  • k8s: v1.29.5-gke.1060000
  • version of gno: 0.1.0-4dafb8ae-nightly

Steps to reproduce

  • Run 3 validators in k8s as statefulset, setting up under p2p.persistent_peers and p2p.seeds in config.toml the Service address of the validators.
  • sw.Logger.Error("Error in peer's address", "err", err)
  • gnoland start

Expected behaviour

The DNS lookup should succeeded and the node should be connected to another peer.

Actual behaviour

The DNS lookup fails. It seems it tries for the second time but it fails after because the DNS record is not ready yet.

Logs

2024-06-18T09:29:40.184Z	INFO 	Starting multi	{"module": "proxy", "impl": "multi"}
2024-06-18T09:29:40.184Z	INFO 	Starting localClient	{"module": "proxy", "module": "abci-client", "connection": "query", "impl": "localClient"}
2024-06-18T09:29:40.184Z	INFO 	Starting localClient	{"module": "proxy", "module": "abci-client", "connection": "mempool", "impl": "localClient"}
2024-06-18T09:29:40.184Z	INFO 	Starting localClient	{"module": "proxy", "module": "abci-client", "connection": "consensus", "impl": "localClient"}
2024-06-18T09:29:40.184Z	INFO 	Starting EventStoreService	{"module": "eventstore", "impl": "EventStoreService"}
2024-06-18T09:29:40.184Z	INFO 	ABCI Handshake App Info	{"module": "consensus", "height": 0, "hash": "", "abci-version": "", "app-version": ""}
2024-06-18T09:29:40.184Z	INFO 	ABCI Replay Blocks	{"module": "consensus", "appHeight": 0, "storeHeight": 0, "stateHeight": 0}
2024-06-18T09:29:40.187Z	INFO 	Completed ABCI Handshake - Tendermint and App are synced	{"module": "consensus", "appHeight": 0, "appHash": ""}
2024-06-18T09:29:40.187Z	INFO 	Version info	{"version": "v1.0.0-rc.0"}
2024-06-18T09:29:40.187Z	INFO 	This node is a validator	{"module": "consensus", "addr": "g1e5cn4p8z7jhdylh98jmj8ugw2532lqx8e9kmw5", "pubKey": "gpub1pggj7ard9eg82cjtv4u52epjx56nzwgjyg9zpqh25w6ev6ww6lq70elf7ylvde3zqp06dlhhw7tj0cs4j3hpt3v5mfzgq0"}
2024-06-18T09:29:40.188Z	INFO 	P2P Node ID	{"module": "p2p", "ID": "g1k8telcwr2k88uw6zp57tqxcmujvqf2elxthgdl", "file": "/gnoland-data/secrets/node_key.json"}
2024-06-18T09:29:40.188Z	INFO 	Adding persistent peers	{"module": "p2p", "addrs": ["g1x6uuzyz0t50647wt8nduyxrlyduhj0yruk6vmr@devx-gnoland-val1-0:26657", "g1k8telcwr2k88uw6zp57tqxcmujvqf2elxthgdl@devx-gnoland-val2-0:26657", "g1vpmsut2s6z89rfyqzh5234xvcs5h2rtl238x8x@devx-gnoland-val3-0:26657"]}
2024-06-18T09:29:40.281Z	ERROR	Error in peer's address	{"module": "p2p", "err": "error looking up host (devx-gnoland-val1-0): lookup devx-gnoland-val1-0 on 10.24.0.10:53: no such host"}
2024-06-18T09:29:40.282Z	ERROR	Error in peer's address	{"module": "p2p", "err": "error looking up host (devx-gnoland-val3-0): lookup devx-gnoland-val3-0 on 10.24.0.10:53: no such host"}
2024-06-18T09:29:40.282Z	INFO 	Starting Node	{"impl": "Node"}
2024-06-18T09:29:40.282Z	INFO 	Starting P2P Switch	{"module": "p2p", "impl": "P2P Switch"}
2024-06-18T09:29:40.282Z	INFO 	Starting Reactor	{"module": "mempool", "impl": "Reactor"}
2024-06-18T09:29:40.282Z	INFO 	Starting BlockchainReactor	{"module": "blockchain", "impl": "BlockchainReactor"}
2024-06-18T09:29:40.282Z	INFO 	Starting BlockPool	{"module": "blockchain", "impl": "BlockPool"}
2024-06-18T09:29:40.282Z	INFO 	Starting ConsensusReactor	{"module": "consensus", "impl": "ConsensusReactor"}
2024-06-18T09:29:40.282Z	INFO 	ConsensusReactor 	{"module": "consensus", "fastSync": true}
2024-06-18T09:29:40.283Z	INFO 	Starting RPC HTTP server on [::]:26657	{"module": "rpc-server"}
2024-06-18T09:29:40.358Z	ERROR	Error in peer's address	{"module": "p2p", "err": "error looking up host (devx-gnoland-val1-0): lookup devx-gnoland-val1-0 on 10.24.0.10:53: no such host"}
2024-06-18T09:29:40.358Z	ERROR	Error in peer's address	{"module": "p2p", "err": "error looking up host (devx-gnoland-val3-0): lookup devx-gnoland-val3-0 on 10.24.0.10:53: no such host"}
2024-06-18T09:29:40.359Z	DEBUG	Ignore attempt to connect to ourselves	{"module": "p2p", "addr": "g1k8telcwr2k88uw6zp57tqxcmujvqf2elxthgdl@10.20.0.111:26657", "ourAddr": "g1k8telcwr2k88uw6zp57tqxcmujvqf2elxthgdl@0.0.0.0:26656"}
2024-06-18T09:29:41.314Z	DEBUG	Consensus ticker	{"module": "blockchain", "numPending": 0, "total": 0, "outbound": 0, "inbound": 0}

Proposed solution

The issue should be fixed retrying multiple times the DNS lookup of the P2P peers. In a k8s environment where there are moving parts, it is crucial to have retry and backoff to increase the chance of successful connection

@mazzy89
Copy link
Contributor Author

mazzy89 commented Jun 18, 2024

The code

if _, ok := err.(NetAddressLookupError); ok {
suggests that DNS lookup errors are actually ignored and skipped. However the final result is that

2024-06-18T09:29:44.285Z	DEBUG	Blockpool has no peers	{"module": "blockchain"}

there are no peers added.

@mazzy89
Copy link
Contributor Author

mazzy89 commented Jun 18, 2024

A workaround adopted by many upstream similar services which rely on bootstrap to discover other peers is to introduce publishNotReadyAddresses: true. This solves the problem.

@mazzy89 mazzy89 closed this as completed Jun 18, 2024
@mazzy89 mazzy89 reopened this Jun 18, 2024
@mazzy89
Copy link
Contributor Author

mazzy89 commented Jun 18, 2024

Reopening the issue. Seems that even introducing publishNotReadyAddresses: true in the Service does not help. Some nodes gets up properly, while some other fails. The overall bootstrap mechanism is not deterministic. I would wonder whether a retry in the DNS lookup would help.l

@mazzy89
Copy link
Contributor Author

mazzy89 commented Jun 19, 2024

Gave it another try and seems that after few seconds that node retries to correct to the peers which at that point have DNS available and the lookup succedeed. We can close this.

@mazzy89 mazzy89 closed this as completed Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

1 participant