Register Consul Service and Checks atomically #3935

gahebe · 2018-03-05T11:07:11Z

Context

We use Nomad with Consul and Docker to deploy Grpc services to a cluster. We use Linkerd as the loadbalancer, it uses Consul to get the availability information of the services.

Scenario

We observe transient errors at the consumer when a new instance of the service is getting started during the service update with Nomad.
(Tcp health check is used to avoid routing request to it when the Docker container has started and the service inside is not ready yet to receive requests. Linkerd is configured to only consider services with passing health check status)

Expected behavior

Being able to update the services seamlessly with the above setup.

Versions

Nomad v0.6.3
Consul v1.0.3
Linkerd v1.3.5

Analysis

Linkerd uses blocking queries to get availability info from Consul.
Example: /v1/health/service/testservice?index=1366&dc=dc1&passing=true
I captured the network traffic and this type of query returns the new service instance without the Tcp health check being defined first, then shortly afterwards it returns it the with Tcp health check.
So it seems Consul considers the service “passing” when it does not have health checks defined,
and Nomad first registers the new service instance and then it registers its health checks in a separate call.
The captured network packets (Nomad -> Consul) confirms that it happens separately:

PUT /v1/agent/service/register HTTP/1.1
Host: 127.0.0.1:8500
User-Agent: Go-http-client/1.1
Content-Length: 165
Accept-Encoding: gzip
{"ID":"_nomad-executor-47313988-2a65-66e0-46af-491023330cca-session-testservice",
"Name":"testservice","Port":27123,"Address":"10.0.75.1","Check":null,"Checks":null}

HTTP/1.1 200 OK
Date: Fri, 02 Mar 2018 14:33:31 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8

PUT /v1/agent/check/register HTTP/1.1
Host: 127.0.0.1:8500
User-Agent: Go-http-client/1.1
Content-Length: 233
Accept-Encoding: gzip
{"ID":"942d6c5b20bea9520f70ced61336a2987bf9c530","Name":"service: \"testservice\" check",
"ServiceID":"_nomad-executor-47313988-2a65-66e0-46af-491023330cca-session-testservice",
"Interval":"10s","Timeout":"2s","TCP":"10.0.75.1:27123"}

HTTP/1.1 200 OK
Date: Fri, 02 Mar 2018 14:33:31 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8

Nomad should register the service and its health checks in one call in Consul,
otherwise the new service instance is considered healthy even before Nomad registers its health check, I believe.

The text was updated successfully, but these errors were encountered:

jippi · 2018-03-05T11:16:51Z

we use linkerd with nomad too, it works fine

https://www.nomadproject.io/docs/job-specification/service.html#initial_status set initial state to critical, shouldn't be healthy until your app says so
put something like 5-10s in your task shutdown_delay https://www.nomadproject.io/docs/job-specification/task.html#shutdown_delay - so linkerd forgets about the alloc before the kill signal is even sent to the process
make sure your task kill_signal ( https://www.nomadproject.io/docs/job-specification/task.html#kill_timeout ) is sufficiently long for your task to drain on-going results
make sure your client config has a high enough max kill timeout https://www.nomadproject.io/docs/agent/configuration/client.html#max_kill_timeout
success

gahebe · 2018-03-05T12:21:43Z

@jippi Thanks for your help. In our case the shutdown part is not the problem, we confirmed it works fine (we use shutdown_delay, ...). The problem is the new instance is considered healthy before the health check gets registered. We can almost always reproduce that under heavy load.
I believe that setting initial_status to critical will not change this for two reasons:

This is a property on the health check itself, and our problem is Nomad registers the health check separately bit later.
If you check the second PUT request (in the first comment) that Nomad sends to Consul, you can see that we do not specify the initial_status in the job specification, and in that case it defaults to critical on Consul side

preetapan · 2018-03-05T20:03:17Z

Thanks for reporting this @gahebe. We are aware of this and are tracking fixing this in a future release.

madsholden · 2020-04-29T06:54:17Z

Did you ever find a work-around for this issue @gahebe? We have the same problem, and can't afford the downtime on every deployment.

madsholden · 2021-03-15T13:13:17Z

Has this issue been roadmapped @tgross?

I haven't found a workaround for this issue. I'm really wondering how others are dealing with this. We're using Traefik as a dynamic load balancer, and it often manages to route a request or two to the newly started services before they are marked unhealthy by Consul. We've tried using Traefik health checks in addition to the ones in Consul, but unfortunately new services are marked as healthy before the first health check is done, leaving us with the same problem.

Since we're trying to not lose a single request during deployments, we're unfortunately considering migrating away from Nomad.

tgross · 2021-03-18T12:51:35Z

Has this issue been roadmapped @tgross?

It's been removed from the community issues "needs roadmapping" board to our roadmap, which is unfortunately not yet public. It has not been targeted for the upcoming Nomad 1.1. Nomad 1.2 hasn't been scoped yet. We'd certainly welcome a pull request if this is a show-stopper for you, although it's admittedly a squirrelly area of the code base.

I haven't found a workaround for this issue. I'm really wondering how others are dealing with this. We're using Traefik as a dynamic load balancer, and it often manages to route a request or two to the newly started services before they are marked unhealthy by Consul.

I'm not an expert in traefik, but doesn't it have a polling configuration value for Consul? The race here is fairly short; you might be able to reduce the risk of it (but not eliminate it entirely) by increasing that polling window. I recognize this is a workaround rather than a fix. In any case, I'd expect a production load balancer to be able to handle a one-off dropped request if only because of CAP; there might be some configuration values in traefik you need to tweak here.

greut · 2021-03-25T19:53:22Z

How hard/breaking would it be to go with consul/api.Agent.ServiceRegisterOpts in the first call, meaning it would already contain the checks to be created, then perform the necessary cleanup. I've got the impression that this call if fairly recent, Consul v1.7 and thus would introduce some breaking changes in Nomad itself.

Cheers,

apollo13 · 2021-03-26T18:05:56Z

Call me stupid, but I tried this locally:

diff --git a/command/agent/consul/service_client.go b/command/agent/consul/service_client.go
index 66c00225c..b8340b062 100644
--- a/command/agent/consul/service_client.go
+++ b/command/agent/consul/service_client.go
@@ -941,6 +941,7 @@ func (c *ServiceClient) serviceRegs(ops *operations, service *structs.Service, w
                Meta:              meta,
                Connect:           connect, // will be nil if no Connect stanza
                Proxy:             gateway, // will be nil if no Connect Gateway stanza
+               Checks:            make([]*api.AgentServiceCheck, 0, len(service.Checks)),
        }
        ops.regServices = append(ops.regServices, serviceReg)
 
@@ -952,6 +953,23 @@ func (c *ServiceClient) serviceRegs(ops *operations, service *structs.Service, w
        for _, registration := range checkRegs {
                sreg.checkIDs[registration.ID] = struct{}{}
                ops.regChecks = append(ops.regChecks, registration)
+               serviceReg.Checks = append(serviceReg.Checks, &api.AgentServiceCheck{
+                       CheckID:                registration.ID,
+                       Name:                   registration.Name,
+                       Status:                 registration.Status,
+                       Timeout:                registration.Timeout,
+                       Interval:               registration.Interval,
+                       SuccessBeforePassing:   registration.SuccessBeforePassing,
+                       FailuresBeforeCritical: registration.FailuresBeforeCritical,
+                       TLSSkipVerify:          registration.TLSSkipVerify,
+                       HTTP:                   registration.HTTP,
+                       Method:                 registration.Method,
+                       Header:                 registration.Header,
+                       TCP:                    registration.TCP,
+                       TTL:                    registration.TTL,
+                       GRPC:                   registration.GRPC,
+                       GRPCUseTLS:             registration.GRPCUseTLS,
+               })
        }
 
        return sreg, nil

and it immediately registered the service with the checks. This seems to work (at least as verified by wireshark and & the nomad source) because ServiceClient.sync handles services first, then fetches the checks and only then creates/removes checks. Due to the fact that the registration IDs match the check is just created once. @greut do you think you could try that? There is also no need to remove any other code, because the other code is still needed if a check changes etc…

tgross · 2021-03-26T19:25:22Z

How hard/breaking would it be to go with consul/api.Agent.ServiceRegisterOpts in the first call, meaning it would already contain the checks to be created, then perform the necessary cleanup.

I'm pretty sure Consul supports putting the check in the service registration back to at least 1.0, because I have a project from an old job that does that and just checked the API version there. So I don't think it's a matter of the API not supporting it.

I had an internal discussion with some folks and it looks like a lot of the reasoning for how this was designed was about optimizing writes to Consul's raft DB, but it turns out that Consul already takes care of all of that performance worry under the hood. So now it's just a matter of getting it done. I think the approach @apollo13 has here is promising, but I'd want to make sure that it still works on updates, not just the initial registration.

apollo13 · 2021-03-26T19:30:58Z

I did change the service check and that updated properly. I did not check any other changes to the service, but looking through the code I am confident that this approach should work. It might (?) be worth to look into transactions though to make this an all or nothing sync. But then again I am not sure if this makes sense or not

…

On Fri, Mar 26, 2021, at 20:25, Tim Gross wrote: > How hard/breaking would it be to go with consul/api.Agent.ServiceRegisterOpts in the first call, meaning it would already contain the checks to be created, then perform the necessary cleanup. I'm pretty sure Consul supports putting the check in the service registration back to at least 1.0, because I have a project from an old job that does that and just checked the API version there. So I don't think it's a matter of the API not supporting it. I had an internal discussion with some folks and it looks like a lot of the reasoning for how this was designed was about optimizing writes to Consul's raft DB, but it turns out that Consul already takes care of all of that performance worry under the hood. So now it's just a matter of getting it done. I think the approach @apollo13 <https://github.com/apollo13> has here is promising, but I'd want to make sure that it still works on updates, not just the initial registration. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3935 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAT5C5JBYAQROENJUB6URDTFTNTFANCNFSM4ETQNL7Q>.

tgross · 2021-04-30T18:34:46Z

See also #10482

robloxrob · 2021-05-25T23:23:04Z

We seem to have this issue as well.

madsholden · 2021-06-09T09:28:04Z

I'm not an expert in traefik, but doesn't it have a polling configuration value for Consul? The race here is fairly short; you might be able to reduce the risk of it (but not eliminate it entirely) by increasing that polling window. I recognize this is a workaround rather than a fix. In any case, I'd expect a production load balancer to be able to handle a one-off dropped request if only because of CAP; there might be some configuration values in traefik you need to tweak here.

Unfortunately neither Traefik nor HAProxy (which we also tried) are able to work around this issue directly. They can both be configured with their own health checks, but newly discovered services are by default healthy, which makes it difficult.

The workaround we're using now is having a custom http proxy between Traefik and Consul. Using the http provder in Traefik, and having the proxy only forward services that have a registered health check, works. But it is far from ideal.

apollo13 · 2021-06-09T10:02:56Z

@madsholden Could you test my patch? The more people testing it the sooner it will probably get merged.

rlandingham · 2022-07-11T15:11:34Z

We are running into a similar issue.

Our Setup:
Waypoint
Nomad
Consul
Traffik

We are doing Canary deployments, and no matter what we set the shutdown_delay, it seems that when the old task is being deregistered we are seeing blips where we get 502 going to the application. Originally, we thought shutdown_delay would resolve the issue but it almost seems like there is a split second/seconds where there is no config for the traffic to the proper app.

apollo13 · 2022-08-02T12:06:37Z

@rlandingham could you test my patch (see a few comments above)? If it fixes it for you a PR would certainly be the best step forward.

This PR updates Nomad's Consul service client to include checks in an initial service registration, so that the checks associated with the service are registered "atomically" with the service. Before, we would only register the checks after the service registration, which causes problems where the service is deemed healthy, even if one or more checks are unhealthy - especially problematic in the case where SuccessBeforePassing is configured. Fixes #3935

…14944) * consul: register checks along with service on initial registration This PR updates Nomad's Consul service client to include checks in an initial service registration, so that the checks associated with the service are registered "atomically" with the service. Before, we would only register the checks after the service registration, which causes problems where the service is deemed healthy, even if one or more checks are unhealthy - especially problematic in the case where SuccessBeforePassing is configured. Fixes #3935 * cr: followup to fix cause of extra consul logging * cr: fix another bug * cr: fixup changelog

github-actions · 2023-02-17T02:17:40Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

preetapan added type/enhancement theme/consul labels Mar 5, 2018

dadgar changed the title ~~Seamless update of applications is not possible when relying on health check~~ Register Consul Service and Checks atomically Mar 5, 2018

madsholden mentioned this issue Apr 20, 2020

Services are registered without health check #7736

Closed

tgross mentioned this issue Sep 18, 2020

template stanza doesn't load env as expected #6112

Open

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

greut mentioned this issue Mar 28, 2021

feat: service and checks greut/nomad#1

Open

tgross mentioned this issue Apr 30, 2021

mishandling of service deregistration with script healthcheck #10482

Closed

tgross mentioned this issue Jul 8, 2021

Nomad alloc_health_watcher marks alloc healthy before it even registered in consul #9175

Open

mikenomitch added the theme/networking label Mar 31, 2022

jrasell added the theme/service-discovery/consul label Apr 11, 2022

shoenig self-assigned this Oct 17, 2022

shoenig added this to the 1.5.0 milestone Oct 17, 2022

shoenig mentioned this issue Oct 17, 2022

consul: all services being re-registered all the time #14914

Closed

shoenig modified the milestones: 1.5.0, 1.4.2 Oct 18, 2022

shoenig mentioned this issue Oct 18, 2022

consul: register checks along with service on initial registration #14944

Merged

shoenig closed this as completed in #14944 Oct 19, 2022

github-actions bot locked as resolved and limited conversation to collaborators Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register Consul Service and Checks atomically #3935

Register Consul Service and Checks atomically #3935

gahebe commented Mar 5, 2018

jippi commented Mar 5, 2018

gahebe commented Mar 5, 2018

preetapan commented Mar 5, 2018

madsholden commented Apr 29, 2020

madsholden commented Mar 15, 2021

tgross commented Mar 18, 2021

greut commented Mar 25, 2021 •

edited

apollo13 commented Mar 26, 2021

tgross commented Mar 26, 2021

apollo13 commented Mar 26, 2021 via email

tgross commented Apr 30, 2021

robloxrob commented May 25, 2021

madsholden commented Jun 9, 2021

apollo13 commented Jun 9, 2021

rlandingham commented Jul 11, 2022

apollo13 commented Aug 2, 2022

github-actions bot commented Feb 17, 2023

Register Consul Service and Checks atomically #3935

Register Consul Service and Checks atomically #3935

Comments

gahebe commented Mar 5, 2018

Context

Scenario

Expected behavior

Versions

Analysis

jippi commented Mar 5, 2018

gahebe commented Mar 5, 2018

preetapan commented Mar 5, 2018

madsholden commented Apr 29, 2020

madsholden commented Mar 15, 2021

tgross commented Mar 18, 2021

greut commented Mar 25, 2021 • edited

apollo13 commented Mar 26, 2021

tgross commented Mar 26, 2021

apollo13 commented Mar 26, 2021 via email

tgross commented Apr 30, 2021

robloxrob commented May 25, 2021

madsholden commented Jun 9, 2021

apollo13 commented Jun 9, 2021

rlandingham commented Jul 11, 2022

apollo13 commented Aug 2, 2022

github-actions bot commented Feb 17, 2023

greut commented Mar 25, 2021 •

edited