Add timeout to server_join #10986

mr-karan · 2021-08-02T07:17:09Z

Proposal

It will be nice if server_join stanza can have a timeout field in the server_join stanza.

Use-cases

Purposefully gave a wrong config to a Nomad agent running as server and from the logs:

Aug 02 12:38:23 nomad-node-0 nomad[50839]: ==> Newer Nomad version available: 1.1.3 (currently running: 1.1.2)
Aug 02 12:38:29 nomad-node-0 nomad[50839]:     2021-08-02T12:38:29.914+0530 [INFO]  client: node registration complete
Aug 02 12:39:20 nomad-node-0 nomad[50839]:     2021-08-02T12:39:20.239+0530 [WARN]  agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:39:20 nomad-node-0 nomad[50839]:         * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:39:20 nomad-node-0 nomad[50839]:         * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:39:20 nomad-node-0 nomad[50839]: " retry=15s


Aug 02 12:40:35 nomad-node-0 nomad[50839]:     2021-08-02T12:40:35.243+0530 [WARN]  agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:40:35 nomad-node-0 nomad[50839]:         * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:40:35 nomad-node-0 nomad[50839]:         * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:40:35 nomad-node-0 nomad[50839]: " retry=15s
Aug 02 12:41:50 nomad-node-0 nomad[50839]:     2021-08-02T12:41:50.248+0530 [WARN]  agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:41:50 nomad-node-0 nomad[50839]:         * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:41:50 nomad-node-0 nomad[50839]:         * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:41:50 nomad-node-0 nomad[50839]: " retry=15s
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.252+0530 [ERROR] agent.joiner: max join retry exhausted, exiting
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.253+0530 [INFO]  agent: requesting shutdown
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.253+0530 [INFO]  client: shutting down
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.253+0530 [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.256+0530 [INFO]  client.plugin: plugin manager finished: plugin-type=device
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.256+0530 [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.259+0530 [INFO]  client.plugin: plugin manager finished: plugin-type=driver
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.259+0530 [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.261+0530 [INFO]  client.plugin: plugin manager finished: plugin-type=csi
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.262+0530 [INFO]  nomad: shutting down server
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.262+0530 [WARN]  nomad: serf: Shutdown without a Leave
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.263+0530 [INFO]  nomad: cluster leadership lost
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.263+0530 [INFO]  agent: shutdown complete

You can see that Nomad almost took 5 minutes to see that the servers is unable to join and then the service exited.

Since there's no timeout defined, I am guessing it waits for a default of 60s or something higher. There's no way to configure that, which makes retry_interval also useless since the next retry will happen only once the first attempt failed (which is 75s according to the logs I shared).

So maybe we can add a timeout and give a sane config like 5s or something as a default as well (It should be less than retry_interval).

The text was updated successfully, but these errors were encountered:

jrasell · 2021-08-02T08:40:57Z

Hi @mr-karan; is it possible to share the example configuration you are using?

The retry join functionality uses the Serf library which currently is configured with a default configuration option set. In order to allow modifying the timeout on join, Nomad would need to expose Serf configuration options to the operator. This is possible, but would need investigation in order to understand any knock-on affects of certain parameters.

mr-karan · 2021-08-02T09:01:54Z

Sorry, missed adding in the original post. Here you go (it's copied as-is from here)

server_join {
  retry_join = [ "1.1.1.1", "2.2.2.2" ]
  retry_max = 3
  retry_interval = "15s"
}

mr-karan added the type/enhancement label Aug 2, 2021

jrasell added stage/needs-discussion theme/core labels Aug 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timeout to server_join #10986

Add timeout to server_join #10986

mr-karan commented Aug 2, 2021

jrasell commented Aug 2, 2021

mr-karan commented Aug 2, 2021

Add timeout to server_join #10986

Add timeout to server_join #10986

Comments

mr-karan commented Aug 2, 2021

Proposal

Use-cases

jrasell commented Aug 2, 2021

mr-karan commented Aug 2, 2021