Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

Nomad: waypoint-static-runner - error connecting to server context deadline exceeded #4550

Open
chrisvanmeer opened this issue Feb 23, 2023 · 5 comments
Labels
bug Something isn't working core/install jira Will add an Issue to Jira plugin/nomad

Comments

@chrisvanmeer
Copy link

chrisvanmeer commented Feb 23, 2023

Describe the bug
My first attempt at running Waypoint on Nomad failed with the waypoint-server having the same error stated above. I was pointed to the 0.11.0 which has a fix in there for this behaviour. This problem now persists in the waypoint-static-runner.

Steps to Reproduce
I since then upgraded to the suggested version and tried the install again on a new Nomad cluster (ACL bootstrapped, mTLS and gossip encryption, consul integration) and followed the same tutorial. Now the waypoint-server job indeed finishes healthy, but now the waypoint-static-runner job returns the same error.

stdout from the waypoint install command

[chris@desktop]$ waypoint install -plain -platform=nomad -accept-tos -nomad-dc=nl -nomad-host-volume=wp-server-vol -nomad-runner-host-volume=wp-runner-vol -nomad-consul-datacenter=nl -nomad-consul-domain=vanmeer.eu
-> Initializing Nomad client...
-> Checking for existing Waypoint server...
-> Installing Waypoint server to Nomad
-> Waiting for allocation to be scheduled
-> Nomad allocation pending...
-> Nomad allocation created
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Waiting for allocation "f823952b-e8dd-ebaf-9d32-036406e31343" to start
-> Nomad allocation running
-> Ensuring allocation "f823952b-e8dd-ebaf-9d32-036406e31343" has properly started up...
-> Nomad allocation running
-> Ensuring allocation "f823952b-e8dd-ebaf-9d32-036406e31343" has properly started up...
-> Nomad allocation running
-> Waypoint server ready
The CLI has been configured to automatically install a Consul service for
the Waypoint service backend and ui service in Nomad.
-> Connecting to: waypoint-server.service.nl.vanmeer.eu:9701
-> Attempting to make connection to server...
-> Successfully connected to Waypoint server in Nomad!
-> Configured server connection
-> 
-> Retrieving initial auth token...
-> Configuring server...
-> Server installed and configured!
-> 
-> Retrieving new auth token for runner...
-> Installing runner...
-> Initializing Nomad client...
-> Installing the Waypoint runner
-> Waiting for allocation to be scheduled
-> Nomad allocation pending...
-> Nomad allocation created
-> Waiting for allocation "1911ad82-303d-923c-b369-9199141b4d1d" to start
-> Waiting for allocation "1911ad82-303d-923c-b369-9199141b4d1d" to start
-> Waiting for allocation "1911ad82-303d-923c-b369-9199141b4d1d" to start
-> Waiting for allocation "1911ad82-303d-923c-b369-9199141b4d1d" to start
-> Nomad allocation running
-> Ensuring allocation "1911ad82-303d-923c-b369-9199141b4d1d" has properly started up...
-> Nomad allocation running
-> Ensuring allocation "1911ad82-303d-923c-b369-9199141b4d1d" has properly started up...
-> Nomad allocation running
-> Waypoint runner installed
-> Runner "static" installed
-> Waiting for runner to connect to server at waypoint-server.service.nl.vanmeer.eu:9701...
! Error adopting runner: runner not detected by server after 5 minutes

stderr logs of the waypoint-static-runner allocation

[root@nomad-client logs]# cat runner.stderr.0 
2023-02-23T09:04:10.233Z [INFO]  waypoint: waypoint version: full_string="v0.11.0 (e92d6fbe0+CHANGES)" version=v0.11.0 prerelease="" metadata="" revision="e92d6fbe0+CHANGES"
2023-02-23T09:04:10.234Z [DEBUG] waypoint: home configuration directory: path=/home/waypoint/.config/waypoint
2023-02-23T09:04:10.234Z [INFO]  waypoint.server: attempting to source credentials and connect
2023-02-23T09:04:10.238Z [DEBUG] waypoint.serverclient: connection information: address=waypoint-server.service.nl.vanmeer.eu:9701 tls=true tls_skip_verify=true send_auth=true has_token=true
2023-02-23T09:04:20.236Z [ERROR] waypoint: failed to create client: error="error connecting to server: context deadline exceeded"
2023-02-23T09:04:37.101Z [INFO]  waypoint: waypoint version: full_string="v0.11.0 (e92d6fbe0+CHANGES)" version=v0.11.0 prerelease="" metadata="" revision="e92d6fbe0+CHANGES"
2023-02-23T09:04:37.102Z [DEBUG] waypoint: home configuration directory: path=/home/waypoint/.config/waypoint
2023-02-23T09:04:37.102Z [INFO]  waypoint.server: attempting to source credentials and connect
2023-02-23T09:04:37.105Z [DEBUG] waypoint.serverclient: connection information: address=waypoint-server.service.nl.vanmeer.eu:9701 tls=true tls_skip_verify=true send_auth=true has_token=true
2023-02-23T09:04:47.103Z [ERROR] waypoint: failed to create client: error="error connecting to server: context deadline exceeded"
2023-02-23T09:05:04.444Z [INFO]  waypoint: waypoint version: full_string="v0.11.0 (e92d6fbe0+CHANGES)" version=v0.11.0 prerelease="" metadata="" revision="e92d6fbe0+CHANGES"
2023-02-23T09:05:04.446Z [DEBUG] waypoint: home configuration directory: path=/home/waypoint/.config/waypoint
2023-02-23T09:05:04.446Z [INFO]  waypoint.server: attempting to source credentials and connect
2023-02-23T09:05:04.454Z [DEBUG] waypoint.serverclient: connection information: address=waypoint-server.service.nl.vanmeer.eu:9701 tls=true tls_skip_verify=true send_auth=true has_token=true
2023-02-23T09:05:14.447Z [ERROR] waypoint: failed to create client: error="error connecting to server: context deadline exceeded"

But the waypoint server is accessible to both desktop and servers

[chris@desktop]$ nc -z waypoint-server.service.nl.vanmeer.eu 9701; echo $?
0
[chris@desktop]$ curl -k https://waypoint-server.service.nl.vanmeer.eu:9701
curl: (1) Received HTTP/0.9 when not allowed

Expected behavior
waypoint-static-runner job healthy on Nomad and installer runs through without issues.

Waypoint Platform Versions
Additional version and platform information to help triage the issue if
applicable:

  • Waypoint CLI Version: 0.11.0
  • Waypoint Server Platform and Version: nomad
  • Waypoint Plugin: N/A
@briancain briancain added bug Something isn't working plugin/nomad core/install jira Will add an Issue to Jira and removed new labels Mar 1, 2023
@briancain
Copy link
Member

Hey there @chrisvanmeer - I think the link you made to fix doesn't seem to resolve to anything, what URL did you intend to share there for the fix?

Can you confirm that when your runner allocation gets started, it's able to resolve your Nomad server allocation via Consuls DNS? It could be that the runner can't resolve the hostname, which is why it fails to connect and times out.

@chrisvanmeer
Copy link
Author

Hey @briancain sorry I updated the link. I intended to refer to the release notes of the fix: #4363.

Do you have any tips on troubleshooting that? The runner job doesn't output any logs that seems that it cannot resolve it and the job keeps restarting quickly so I cannot exec into it to perform decent troubleshooting.

@briancain
Copy link
Member

@chrisvanmeer - I believe this is a nomad setting, but I would recommend configuring Nomad to not clean up allocations on failures immediately. That should leave them around long enough for you to dive in and try to troubleshoot what's going on. Hopefully that helps! If you can resolve the nomad server addr and get a response outside of the runner (i.e. with telnet or curl), then it's likely another issue with our runner install.

And I see! Thanks for updating the link. Yes - I don't think that PR you linked should be causing the issue, from what I can tell in the logs, it should be connecting to the right port for gRPC: 2023-02-23T09:04:10.238Z [DEBUG] waypoint.serverclient: connection information: address=waypoint-server.service.nl.vanmeer.eu:9701.

@chrisvanmeer
Copy link
Author

With a lot of patience and copy paste commands ready I managed to verify that the waypoint-static-runner indeed cannot resolve the server.

$ nomad alloc exec -task runner c90bf8d8 sh
/ $ ping waypoint-server.service.nl.vanmeer.eu
ping: bad address 'waypoint-server.service.nl.vanmeer.eu'
/ $ ping waypoint-server.service.vanmeer.eu
ping: bad address 'waypoint-server.service.vanmeer.eu'
/ $ %

Even when I manually specify a DNS server in the job spec of the runner and restart it, it will not resolve the server address. Which is funny, since other jobs on the same nomad cluster can resolve each other fine.

@jjchiw
Copy link

jjchiw commented Sep 11, 2023

Hi!

I have the same error, I'm testing with a single node and since waypoint-server.service.dc1.consul is resolved correctly in the runner container , waypoint-server.service.dc1.consul targets 127.0.0.1 so the runner container can not find the server

I tried to set an ip address like this

-nomad-service-address=172.26.64.200 -nomad-network-mode=bridge

And when I do that the server doesn't start.

I also was trying to set the waypoint service's address_mode to alloc and it seems there is no option to set that https://developer.hashicorp.com/waypoint/commands/install#nomad-service-address

This is the Network Definition in the waypoint-server job

"Networks": [
        {
          "Mode": "bridge",
          "Device": "",
          "CIDR": "",
          "IP": "",
          "Hostname": "",
          "MBits": 0,
          "DNS": null,
          "ReservedPorts": [
            {
              "Label": "ui",
              "Value": 9702,
              "To": 9702,
              "HostNetwork": "default"
            },
            {
              "Label": "server",
              "Value": 9701,
              "To": 9701,
              "HostNetwork": "default"
            }
          ],
          "DynamicPorts": null
        }
      ],

And this one the Services

"Services": [
        {
          "Name": "waypoint-ui",
          "TaskName": "",
          "PortLabel": "ui",
          "AddressMode": "auto",
          "Address": "172.26.64.200",
          "EnableTagOverride": false,
          "Tags": [
            "waypoint"
          ],
          "CanaryTags": null,
          "Checks": null,
          "Connect": null,
          "Meta": null,
          "CanaryMeta": null,
          "TaggedAddresses": null,
          "Namespace": "default",
          "OnUpdate": "require_healthy",
          "Provider": "consul"
        },
        {
          "Name": "waypoint-server",
          "TaskName": "",
          "PortLabel": "server",
          "AddressMode": "auto",
          "Address": "172.26.64.200",
          "EnableTagOverride": false,
          "Tags": [
            "waypoint"
          ],
          "CanaryTags": null,
          "Checks": null,
          "Connect": null,
          "Meta": null,
          "CanaryMeta": null,
          "TaggedAddresses": null,
          "Namespace": "default",
          "OnUpdate": "require_healthy",
          "Provider": "consul"
        }
      ],

I'm running this version

Welcome to Waypoint
Docs: https://waypointproject.io
Version: v0.11.4

Edit

I found that when I set the -nomad-service-address=172.26.64.200 the address 172.26.64.200 is set in consul and also is set in the Service Description but the container's ip address is different so I followed the the ip address assignment and I set the ip address as 172.26.64.78 and now it passes the server connection :)

Then I returned to the first error the runner wasn't connecting :(

I saw that the ip address assigned to the client was not in the range of nomad

> ip addr | grep 172
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
inet 172.26.64.1/20 brd 172.26.79.255 scope global nomad

It was in the range of docker0... So what I did was add the rule in ufw

ufw allow from 172.17.0.1/16 to 172.17.0.1 port 53

And now it works.... 🎉

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working core/install jira Will add an Issue to Jira plugin/nomad
Projects
None yet
Development

No branches or pull requests

4 participants