Remove loopback when monitor fails #25

philipcristiano · 2023-10-04T17:51:31Z

When a monitor succeeds the loopback, nat, and announcement are created. When a monitor fails the announcement stops but the loopback addr remains until the monitor is removed. This is causing an unexpected behavior where a replacement Nomad job cannot reach resources on another host because the addr on a host remains.

For my setup I have multiple Traefik instances running with the same VIP. During a deploy the instances will be replaced but need to pull a new Docker image. Traefik is used as the LB to the Docker registry and the not-running Traefik instance cannot respond. The VIP is still assigned (but not announced) on the host and requests to the VIP fail (because Traefik isn't running).

Is this expected behavior? Should the loopback be removed when the monitor/consul-check fails?

mayuresh82 · 2023-10-18T18:56:15Z

do you have an appropriate cleanup timer set for the app ? When a monitor for an app fails, it starts the cleanup timer. only when the cleanup timer expires, the app and corresponding loopback is removed . See

gocast/controller/monitor.go

Line 343 in 6f26e86

func (m *MonitorMgr) Cleanup(app string, exit chan bool) {

and

gocast/controller/monitor.go

Line 200 in 6f26e86

if err := deleteLoopback(a.app.Vip.Net); err != nil {

Are you also seeing any errors related to removing loopback in the gocast logs ?

philipcristiano · 2023-10-18T19:12:28Z

I definitely don’t have an appropriate cleanup timer set. I see the default is 15m and was assuming much faster. I’ll adjust that and give it a try!

To remove the vip addr from the host. This should reduce the time same-host apps appear broken when there is a non-working addr on the localhost ref: mayuresh82/gocast#25 (comment)

philipcristiano · 2023-10-20T15:41:21Z

This appears to work with the lower timeout! Do you know of any practical minimum? I'm trying with 15s now but need to experiment more.

Would 1s or 5s cause any problems / is there a reason why the addr shouldn't be removed with the annoucement instead of waiting for the cleanup timer? I was caught of guard that anything else on the same host would be ~broken until that cleanup timer completed, even though the route was gone.

mayuresh82 · 2023-10-26T01:40:21Z

the original idea was that transient issues can cause apps to appear/disappear in consul, and so this is just a prevention mechanism against flapping adding removing loopbacks too often. I do agree though that this can be removed such that a default timer of 0 means remove immediately. I can give it a go, or feel free to send a PR for this !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove loopback when monitor fails #25

Remove loopback when monitor fails #25

philipcristiano commented Oct 4, 2023

mayuresh82 commented Oct 18, 2023

philipcristiano commented Oct 18, 2023

philipcristiano commented Oct 20, 2023 •

edited

mayuresh82 commented Oct 26, 2023

Remove loopback when monitor fails #25

Remove loopback when monitor fails #25

Comments

philipcristiano commented Oct 4, 2023

mayuresh82 commented Oct 18, 2023

philipcristiano commented Oct 18, 2023

philipcristiano commented Oct 20, 2023 • edited

mayuresh82 commented Oct 26, 2023

philipcristiano commented Oct 20, 2023 •

edited