You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It will be nice if server_join stanza can have a timeout field in the server_join stanza.
Use-cases
Purposefully gave a wrong config to a Nomad agent running as server and from the logs:
Aug 02 12:38:23 nomad-node-0 nomad[50839]: ==> Newer Nomad version available: 1.1.3 (currently running: 1.1.2)
Aug 02 12:38:29 nomad-node-0 nomad[50839]: 2021-08-02T12:38:29.914+0530 [INFO] client: node registration complete
Aug 02 12:39:20 nomad-node-0 nomad[50839]: 2021-08-02T12:39:20.239+0530 [WARN] agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:39:20 nomad-node-0 nomad[50839]: * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:39:20 nomad-node-0 nomad[50839]: * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:39:20 nomad-node-0 nomad[50839]: " retry=15s
Aug 02 12:40:35 nomad-node-0 nomad[50839]: 2021-08-02T12:40:35.243+0530 [WARN] agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:40:35 nomad-node-0 nomad[50839]: * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:40:35 nomad-node-0 nomad[50839]: * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:40:35 nomad-node-0 nomad[50839]: " retry=15s
Aug 02 12:41:50 nomad-node-0 nomad[50839]: 2021-08-02T12:41:50.248+0530 [WARN] agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:41:50 nomad-node-0 nomad[50839]: * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:41:50 nomad-node-0 nomad[50839]: * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:41:50 nomad-node-0 nomad[50839]: " retry=15s
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.252+0530 [ERROR] agent.joiner: max join retry exhausted, exiting
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.253+0530 [INFO] agent: requesting shutdown
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.253+0530 [INFO] client: shutting down
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.253+0530 [INFO] client.plugin: shutting down plugin manager: plugin-type=device
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.256+0530 [INFO] client.plugin: plugin manager finished: plugin-type=device
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.256+0530 [INFO] client.plugin: shutting down plugin manager: plugin-type=driver
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.259+0530 [INFO] client.plugin: plugin manager finished: plugin-type=driver
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.259+0530 [INFO] client.plugin: shutting down plugin manager: plugin-type=csi
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.261+0530 [INFO] client.plugin: plugin manager finished: plugin-type=csi
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.262+0530 [INFO] nomad: shutting down server
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.262+0530 [WARN] nomad: serf: Shutdown without a Leave
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.263+0530 [INFO] nomad: cluster leadership lost
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.263+0530 [INFO] agent: shutdown complete
You can see that Nomad almost took 5 minutes to see that the servers is unable to join and then the service exited.
Since there's no timeout defined, I am guessing it waits for a default of 60s or something higher. There's no way to configure that, which makes retry_interval also useless since the next retry will happen only once the first attempt failed (which is 75s according to the logs I shared).
So maybe we can add a timeout and give a sane config like 5s or something as a default as well (It should be less than retry_interval).
The text was updated successfully, but these errors were encountered:
Hi @mr-karan; is it possible to share the example configuration you are using?
The retry join functionality uses the Serf library which currently is configured with a default configuration option set. In order to allow modifying the timeout on join, Nomad would need to expose Serf configuration options to the operator. This is possible, but would need investigation in order to understand any knock-on affects of certain parameters.
Proposal
It will be nice if
server_join
stanza can have atimeout
field in theserver_join
stanza.Use-cases
Purposefully gave a wrong config to a Nomad agent running as server and from the logs:
You can see that Nomad almost took 5 minutes to see that the servers is unable to join and then the service exited.
Since there's no
timeout
defined, I am guessing it waits for a default of 60s or something higher. There's no way to configure that, which makesretry_interval
also useless since the next retry will happen only once the first attempt failed (which is 75s according to the logs I shared).So maybe we can add a
timeout
and give a sane config like5s
or something as a default as well (It should be less thanretry_interval
).The text was updated successfully, but these errors were encountered: