Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

waypoint install on nomad service discovery problems #2314

Closed
izaaklauer opened this issue Sep 16, 2021 · 5 comments
Closed

waypoint install on nomad service discovery problems #2314

izaaklauer opened this issue Sep 16, 2021 · 5 comments
Assignees
Labels
bug Something isn't working plugin/nomad

Comments

@izaaklauer
Copy link
Contributor

izaaklauer commented Sep 16, 2021

Describe the bug

Currently, when installing waypoint on nomad with waypoint server install -platform=nomad, the local waypoint CLI schedules a waypoint server nomad job, gets the resultant allocation's ip:port, and uses that as the waypoint server address for the new context. There are a few problems with this:

1: If nomad ever schedules a new allocation for the job (for example when a nomad client is drained for maintenance), the new allocation will have a new IP, and all existing contexts will break.

2: It isn't possible to set the allocation IP - it's automatically determined based on the IP of the host (at least with the network settings we've tried). This means that if a user is running Nomad on a typical ec2 instance with only a private IP configured in the network interface inside the VM, the allocation will always have a private ip and won't be reachable from the internet. This might be OK for users extending their VPC to developer laptops with a VPN, but isn't an assumption I think we should make.

Steps to Reproduce
Steps to reproduce the behavior.

We discovered this testing with nomad on ec2. To reproduce:

  • Create a new ec2 instance with a public and a private IP
  • Install nomad and docker
  • From your laptop, run NOMAD_ADDR=<ec2-ip>:4646 waypoint server install -platform=nomad -accept-tos
  • Observe that the command hangs, waiting for something like 172.31.1.9:9701 (a private ip) to be reachable, but this is the private IP of the ec2 instance and will never work.

Expected behavior
Installing on nomad should be possible on ec2 nomad clusters without assuming a VPN, and should be resilient to allocation changes.

Options

  • One option may be to auto-detect if consul is available, and if so create a consul service for our waypoint server. I'm not sure what percentage of nomad users also run consul, but I expect it's quite high.

  • The Nomad team may introduce some lightweight service discovery that we could use, but it isn't present today.

Waypoint Platform Versions
Additional version and platform information to help triage the issue if
applicable:

  • Waypoint CLI Version: 0.5.2
  • Waypoint Server Platform and Version: 0.5.2
  • Waypoint Plugin: nomad
@izaaklauer izaaklauer added bug Something isn't working plugin/nomad new labels Sep 16, 2021
@evanphx
Copy link
Contributor

evanphx commented Sep 23, 2021

This is sort of a nomad installation issue. While we setup the context with the allocation IP, there isn't a lot of other options.

If someone is using nomad in production, they'll have to setup the waypoint context with a more stable identifier. Perhaps we need to just output a message about this fact during the waypoint install process.

@izaaklauer
Copy link
Contributor Author

I'm a bit hesitant to ask users to manually modify their context file after a server install, and I also suspect that the invite flow pulls the address from the waypoint server, which would propagate the unstable alloc IP.

Another quick-and-dirty option: Now that we have #2328 merged, nomad users could create a persistent address somehow (consul, clever DNS, etc) before the install, and then use waypoint server install -platform=nomad -accept-tos -- -advertise-addr=waypoint.mycorp.internal. We could even print a warning at the end of the nomad install if they didn't set this flag - something like:

Nomad server installation is complete, but is using an ephemeral allocation ip address (10.0.1.90:9701). This is subject to change, and when it does it will break all waypoint contexts. To avoid this, create a central service address for the waypoint server (i.e. a consul service), uninstall this waypoint server, and re-install it with the -- -advertise-addr=<service-address>:9701 flag.

I don't really like that idea either.

@izaaklauer
Copy link
Contributor Author

Neat idea from @krantzinator - it might be possible to do a server upgrade after the initial install, and specify the new advertise addr there. That would work even if it's impossible to know the name of the consul service that will be created ahead of time. Any of these would require some experimentation.

@evanphx
Copy link
Contributor

evanphx commented Oct 6, 2021

A simple "fix" is to emit a warning on waypoint install that the context is using the allocation address.

@briancain
Copy link
Member

This was fixed by introducing Consul DNS to the waypoint install command for the Nomad platform. I'm going to go ahead and close this since I think it's resolved!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working plugin/nomad
Projects
None yet
Development

No branches or pull requests

3 participants