Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove loopback when monitor fails #25

Open
philipcristiano opened this issue Oct 4, 2023 · 4 comments
Open

Remove loopback when monitor fails #25

philipcristiano opened this issue Oct 4, 2023 · 4 comments

Comments

@philipcristiano
Copy link
Contributor

When a monitor succeeds the loopback, nat, and announcement are created. When a monitor fails the announcement stops but the loopback addr remains until the monitor is removed. This is causing an unexpected behavior where a replacement Nomad job cannot reach resources on another host because the addr on a host remains.

For my setup I have multiple Traefik instances running with the same VIP. During a deploy the instances will be replaced but need to pull a new Docker image. Traefik is used as the LB to the Docker registry and the not-running Traefik instance cannot respond. The VIP is still assigned (but not announced) on the host and requests to the VIP fail (because Traefik isn't running).

Is this expected behavior? Should the loopback be removed when the monitor/consul-check fails?

@mayuresh82
Copy link
Owner

do you have an appropriate cleanup timer set for the app ? When a monitor for an app fails, it starts the cleanup timer. only when the cleanup timer expires, the app and corresponding loopback is removed . See

func (m *MonitorMgr) Cleanup(app string, exit chan bool) {
and
if err := deleteLoopback(a.app.Vip.Net); err != nil {

Are you also seeing any errors related to removing loopback in the gocast logs ?

@philipcristiano
Copy link
Contributor Author

I definitely don’t have an appropriate cleanup timer set. I see the default is 15m and was assuming much faster. I’ll adjust that and give it a try!

philipcristiano added a commit to philipcristiano/nixos-cluster-config that referenced this issue Oct 20, 2023
To remove the vip addr from the host. This should reduce the time
same-host apps appear broken when there is a non-working addr on the
localhost

ref: mayuresh82/gocast#25 (comment)
@philipcristiano
Copy link
Contributor Author

philipcristiano commented Oct 20, 2023

This appears to work with the lower timeout! Do you know of any practical minimum? I'm trying with 15s now but need to experiment more.

Would 1s or 5s cause any problems / is there a reason why the addr shouldn't be removed with the annoucement instead of waiting for the cleanup timer? I was caught of guard that anything else on the same host would be ~broken until that cleanup timer completed, even though the route was gone.

@mayuresh82
Copy link
Owner

the original idea was that transient issues can cause apps to appear/disappear in consul, and so this is just a prevention mechanism against flapping adding removing loopbacks too often. I do agree though that this can be removed such that a default timer of 0 means remove immediately. I can give it a go, or feel free to send a PR for this !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants