-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some services are being registered all the time #18538
Comments
I am seeing the same at the trace level, happening all 30 seconds:
Interestingly enough this only seems to happen for 2 out of 4 services with sidecars on the host. Which information can I provide to aid debugging= |
Thanks for the report @jorgemarey. I was able to reproduce the first issue (but not the second so far). Curiously, in my case, the diff was caused by In the case of #18692 helps the situation a bit by allowing these two values to be empty, meaning Nomad will just accept whatever is set in Consul as correct. @apollo13 unfortunately we don't have very descriptive logs on this diff logic. I'm assuming @jorgemarey patched the Nomad code to drill down exactly which filed was the problem 😅 One thing you could try is to compare these services definition in your job and what's in the Consul agent as reported by the In my case I noticed the service {
name = "nginx"
port = "http"
connect {
sidecar_service {
proxy {
upstreams {
destination_name = "whoami"
+ destination_type = "service"
local_bind_port = 8080
}
}
}
}
} |
@lgfa29 Thank you, adding EDIT:// Funny story though: I looked through the nodes of a test cluster and only one job has this issue (and not the same job as last time). So I wonder if depending on how/when against which consul version the job got registered it might have worked? Or maybe it depends on the moon… |
Yeah, we have a consul proxy-defaults config with
I guess that's what caused this.
Seeing that PR. I see a situation that I don't know what will happen:
Yep, I did patch the code to output where nomad and consul had the diff. Regarding the second issue. The problem with that is that the check is only registered with consul for a few milliseconds before it's removed again by the synchronization process. Once the synchronization starts nomad will register the service with both checks as it sees a difference in the service between what it knows and it sees in consul and later during that process nomad will see a check that shouldn't be there and will remove it, so it'll only last for a few milliseconds in consul. But I managed to notice that by making blocking queries to the consul agent API and also in the nomad logs I could see (constantly):
I'll try reproduce it myself and provide some information about how to do it if I manage to so. |
Hum...yeah, I think you're right. To make things worse I think the service will not be updated at all because that difference method is also used to detect job updates sync. I think we're lacking expressiveness here in which an string here means "whatever Consul says it should be", meaning we shouldn't sync, but what user's usually intend for an empty value is for it to be set to Consul's default. So I think I will need to go back to the drawing board on this one 🤔 Ask workaround for now, to avoid the unnecessary syncs you can set all the expected values directly in the job. More specifically, |
Not only destination_type and gateway.mode are affected. When an upstream is declared as an http service (in service-defaults Protocol = "http"), we must also add this in the job description (in my example, immich-ml is the upstream, http service, and immich is the consumer of this service) proxy {
upstreams {
destination_name = "immich-ml"
destination_type = "service"
local_bind_port = 3003
config {
protocol = "http"
}
}
} Without this, Nomad agent is also re-syncing the service on Consul every 30 sec :
This is a pain to maintain :-( |
Nomad version
Nomad v1.5.6
Issue
We had a problem where we saw some consul service flapping to critical for a moment (less than a second) and then become healthy again. While doing some investigation about this we saw that nomad is re-registering the services constantly.
In the client logs we can see:
We found that this happen with all services that have upstreams configured. We found that the problem occurs here, when in nomad the mesh gateway mode is left empty, consul sets it to local, so when checking the difference between the value in the job and in consul, nomad finds a difference and tries to register it again.
Related to this, in that service we modified the interval of the check, and saw that in the process of registering again the service, the old and new checks are both being registered (for a moment before removing again the old check in the sync proccess.
Reproduction steps
For the second problem
The text was updated successfully, but these errors were encountered: