Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout to server_join #10986

Open
mr-karan opened this issue Aug 2, 2021 · 2 comments
Open

Add timeout to server_join #10986

mr-karan opened this issue Aug 2, 2021 · 2 comments

Comments

@mr-karan
Copy link
Contributor

mr-karan commented Aug 2, 2021

Proposal

It will be nice if server_join stanza can have a timeout field in the server_join stanza.

Use-cases

Purposefully gave a wrong config to a Nomad agent running as server and from the logs:

Aug 02 12:38:23 nomad-node-0 nomad[50839]: ==> Newer Nomad version available: 1.1.3 (currently running: 1.1.2)
Aug 02 12:38:29 nomad-node-0 nomad[50839]:     2021-08-02T12:38:29.914+0530 [INFO]  client: node registration complete
Aug 02 12:39:20 nomad-node-0 nomad[50839]:     2021-08-02T12:39:20.239+0530 [WARN]  agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:39:20 nomad-node-0 nomad[50839]:         * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:39:20 nomad-node-0 nomad[50839]:         * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:39:20 nomad-node-0 nomad[50839]: " retry=15s


Aug 02 12:40:35 nomad-node-0 nomad[50839]:     2021-08-02T12:40:35.243+0530 [WARN]  agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:40:35 nomad-node-0 nomad[50839]:         * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:40:35 nomad-node-0 nomad[50839]:         * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:40:35 nomad-node-0 nomad[50839]: " retry=15s
Aug 02 12:41:50 nomad-node-0 nomad[50839]:     2021-08-02T12:41:50.248+0530 [WARN]  agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:41:50 nomad-node-0 nomad[50839]:         * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:41:50 nomad-node-0 nomad[50839]:         * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:41:50 nomad-node-0 nomad[50839]: " retry=15s
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.252+0530 [ERROR] agent.joiner: max join retry exhausted, exiting
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.253+0530 [INFO]  agent: requesting shutdown
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.253+0530 [INFO]  client: shutting down
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.253+0530 [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.256+0530 [INFO]  client.plugin: plugin manager finished: plugin-type=device
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.256+0530 [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.259+0530 [INFO]  client.plugin: plugin manager finished: plugin-type=driver
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.259+0530 [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.261+0530 [INFO]  client.plugin: plugin manager finished: plugin-type=csi
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.262+0530 [INFO]  nomad: shutting down server
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.262+0530 [WARN]  nomad: serf: Shutdown without a Leave
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.263+0530 [INFO]  nomad: cluster leadership lost
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.263+0530 [INFO]  agent: shutdown complete

You can see that Nomad almost took 5 minutes to see that the servers is unable to join and then the service exited.

Since there's no timeout defined, I am guessing it waits for a default of 60s or something higher. There's no way to configure that, which makes retry_interval also useless since the next retry will happen only once the first attempt failed (which is 75s according to the logs I shared).

So maybe we can add a timeout and give a sane config like 5s or something as a default as well (It should be less than retry_interval).

@jrasell
Copy link
Member

jrasell commented Aug 2, 2021

Hi @mr-karan; is it possible to share the example configuration you are using?

The retry join functionality uses the Serf library which currently is configured with a default configuration option set. In order to allow modifying the timeout on join, Nomad would need to expose Serf configuration options to the operator. This is possible, but would need investigation in order to understand any knock-on affects of certain parameters.

@mr-karan
Copy link
Contributor Author

mr-karan commented Aug 2, 2021

Sorry, missed adding in the original post. Here you go (it's copied as-is from here)

server_join {
  retry_join = [ "1.1.1.1", "2.2.2.2" ]
  retry_max = 3
  retry_interval = "15s"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants