Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register Consul Service and Checks atomically #3935

Closed
gahebe opened this issue Mar 5, 2018 · 17 comments · Fixed by #14944
Closed

Register Consul Service and Checks atomically #3935

gahebe opened this issue Mar 5, 2018 · 17 comments · Fixed by #14944

Comments

@gahebe
Copy link

gahebe commented Mar 5, 2018

Context

We use Nomad with Consul and Docker to deploy Grpc services to a cluster. We use Linkerd as the loadbalancer, it uses Consul to get the availability information of the services.

Scenario

We observe transient errors at the consumer when a new instance of the service is getting started during the service update with Nomad.
(Tcp health check is used to avoid routing request to it when the Docker container has started and the service inside is not ready yet to receive requests. Linkerd is configured to only consider services with passing health check status)

Expected behavior

Being able to update the services seamlessly with the above setup.

Versions

Nomad v0.6.3
Consul v1.0.3
Linkerd v1.3.5

Analysis

Linkerd uses blocking queries to get availability info from Consul.
Example: /v1/health/service/testservice?index=1366&dc=dc1&passing=true
I captured the network traffic and this type of query returns the new service instance without the Tcp health check being defined first, then shortly afterwards it returns it the with Tcp health check.
So it seems Consul considers the service “passing” when it does not have health checks defined,
and Nomad first registers the new service instance and then it registers its health checks in a separate call.
The captured network packets (Nomad -> Consul) confirms that it happens separately:

PUT /v1/agent/service/register HTTP/1.1
Host: 127.0.0.1:8500
User-Agent: Go-http-client/1.1
Content-Length: 165
Accept-Encoding: gzip
{"ID":"_nomad-executor-47313988-2a65-66e0-46af-491023330cca-session-testservice",
"Name":"testservice","Port":27123,"Address":"10.0.75.1","Check":null,"Checks":null}

HTTP/1.1 200 OK
Date: Fri, 02 Mar 2018 14:33:31 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8
PUT /v1/agent/check/register HTTP/1.1
Host: 127.0.0.1:8500
User-Agent: Go-http-client/1.1
Content-Length: 233
Accept-Encoding: gzip
{"ID":"942d6c5b20bea9520f70ced61336a2987bf9c530","Name":"service: \"testservice\" check",
"ServiceID":"_nomad-executor-47313988-2a65-66e0-46af-491023330cca-session-testservice",
"Interval":"10s","Timeout":"2s","TCP":"10.0.75.1:27123"}

HTTP/1.1 200 OK
Date: Fri, 02 Mar 2018 14:33:31 GMT
Content-Length: 0
Content-Type: text/plain; charset=utf-8

Nomad should register the service and its health checks in one call in Consul,
otherwise the new service instance is considered healthy even before Nomad registers its health check, I believe.

@jippi
Copy link
Contributor

jippi commented Mar 5, 2018

we use linkerd with nomad too, it works fine

  1. https://www.nomadproject.io/docs/job-specification/service.html#initial_status set initial state to critical, shouldn't be healthy until your app says so
  2. put something like 5-10s in your task shutdown_delay https://www.nomadproject.io/docs/job-specification/task.html#shutdown_delay - so linkerd forgets about the alloc before the kill signal is even sent to the process
  3. make sure your task kill_signal ( https://www.nomadproject.io/docs/job-specification/task.html#kill_timeout ) is sufficiently long for your task to drain on-going results
  4. make sure your client config has a high enough max kill timeout https://www.nomadproject.io/docs/agent/configuration/client.html#max_kill_timeout
  5. success

@gahebe
Copy link
Author

gahebe commented Mar 5, 2018

@jippi Thanks for your help. In our case the shutdown part is not the problem, we confirmed it works fine (we use shutdown_delay, ...). The problem is the new instance is considered healthy before the health check gets registered. We can almost always reproduce that under heavy load.
I believe that setting initial_status to critical will not change this for two reasons:

  1. This is a property on the health check itself, and our problem is Nomad registers the health check separately bit later.
  2. If you check the second PUT request (in the first comment) that Nomad sends to Consul, you can see that we do not specify the initial_status in the job specification, and in that case it defaults to critical on Consul side

@dadgar dadgar changed the title Seamless update of applications is not possible when relying on health check Register Consul Service and Checks atomically Mar 5, 2018
@preetapan
Copy link
Member

Thanks for reporting this @gahebe. We are aware of this and are tracking fixing this in a future release.

@madsholden
Copy link

Did you ever find a work-around for this issue @gahebe? We have the same problem, and can't afford the downtime on every deployment.

@madsholden
Copy link

Has this issue been roadmapped @tgross?

I haven't found a workaround for this issue. I'm really wondering how others are dealing with this. We're using Traefik as a dynamic load balancer, and it often manages to route a request or two to the newly started services before they are marked unhealthy by Consul. We've tried using Traefik health checks in addition to the ones in Consul, but unfortunately new services are marked as healthy before the first health check is done, leaving us with the same problem.

Since we're trying to not lose a single request during deployments, we're unfortunately considering migrating away from Nomad.

@tgross
Copy link
Member

tgross commented Mar 18, 2021

Has this issue been roadmapped @tgross?

It's been removed from the community issues "needs roadmapping" board to our roadmap, which is unfortunately not yet public. It has not been targeted for the upcoming Nomad 1.1. Nomad 1.2 hasn't been scoped yet. We'd certainly welcome a pull request if this is a show-stopper for you, although it's admittedly a squirrelly area of the code base.

I haven't found a workaround for this issue. I'm really wondering how others are dealing with this. We're using Traefik as a dynamic load balancer, and it often manages to route a request or two to the newly started services before they are marked unhealthy by Consul.

I'm not an expert in traefik, but doesn't it have a polling configuration value for Consul? The race here is fairly short; you might be able to reduce the risk of it (but not eliminate it entirely) by increasing that polling window. I recognize this is a workaround rather than a fix. In any case, I'd expect a production load balancer to be able to handle a one-off dropped request if only because of CAP; there might be some configuration values in traefik you need to tweak here.

@greut
Copy link
Contributor

greut commented Mar 25, 2021

How hard/breaking would it be to go with consul/api.Agent.ServiceRegisterOpts in the first call, meaning it would already contain the checks to be created, then perform the necessary cleanup. I've got the impression that this call if fairly recent, Consul v1.7 and thus would introduce some breaking changes in Nomad itself.

Cheers,

@apollo13
Copy link
Contributor

Call me stupid, but I tried this locally:

diff --git a/command/agent/consul/service_client.go b/command/agent/consul/service_client.go
index 66c00225c..b8340b062 100644
--- a/command/agent/consul/service_client.go
+++ b/command/agent/consul/service_client.go
@@ -941,6 +941,7 @@ func (c *ServiceClient) serviceRegs(ops *operations, service *structs.Service, w
                Meta:              meta,
                Connect:           connect, // will be nil if no Connect stanza
                Proxy:             gateway, // will be nil if no Connect Gateway stanza
+               Checks:            make([]*api.AgentServiceCheck, 0, len(service.Checks)),
        }
        ops.regServices = append(ops.regServices, serviceReg)
 
@@ -952,6 +953,23 @@ func (c *ServiceClient) serviceRegs(ops *operations, service *structs.Service, w
        for _, registration := range checkRegs {
                sreg.checkIDs[registration.ID] = struct{}{}
                ops.regChecks = append(ops.regChecks, registration)
+               serviceReg.Checks = append(serviceReg.Checks, &api.AgentServiceCheck{
+                       CheckID:                registration.ID,
+                       Name:                   registration.Name,
+                       Status:                 registration.Status,
+                       Timeout:                registration.Timeout,
+                       Interval:               registration.Interval,
+                       SuccessBeforePassing:   registration.SuccessBeforePassing,
+                       FailuresBeforeCritical: registration.FailuresBeforeCritical,
+                       TLSSkipVerify:          registration.TLSSkipVerify,
+                       HTTP:                   registration.HTTP,
+                       Method:                 registration.Method,
+                       Header:                 registration.Header,
+                       TCP:                    registration.TCP,
+                       TTL:                    registration.TTL,
+                       GRPC:                   registration.GRPC,
+                       GRPCUseTLS:             registration.GRPCUseTLS,
+               })
        }
 
        return sreg, nil

and it immediately registered the service with the checks. This seems to work (at least as verified by wireshark and & the nomad source) because ServiceClient.sync handles services first, then fetches the checks and only then creates/removes checks. Due to the fact that the registration IDs match the check is just created once. @greut do you think you could try that? There is also no need to remove any other code, because the other code is still needed if a check changes etc…

@tgross
Copy link
Member

tgross commented Mar 26, 2021

How hard/breaking would it be to go with consul/api.Agent.ServiceRegisterOpts in the first call, meaning it would already contain the checks to be created, then perform the necessary cleanup.

I'm pretty sure Consul supports putting the check in the service registration back to at least 1.0, because I have a project from an old job that does that and just checked the API version there. So I don't think it's a matter of the API not supporting it.

I had an internal discussion with some folks and it looks like a lot of the reasoning for how this was designed was about optimizing writes to Consul's raft DB, but it turns out that Consul already takes care of all of that performance worry under the hood. So now it's just a matter of getting it done. I think the approach @apollo13 has here is promising, but I'd want to make sure that it still works on updates, not just the initial registration.

@apollo13
Copy link
Contributor

apollo13 commented Mar 26, 2021 via email

@tgross
Copy link
Member

tgross commented Apr 30, 2021

See also #10482

@robloxrob
Copy link

We seem to have this issue as well.

@madsholden
Copy link

I'm not an expert in traefik, but doesn't it have a polling configuration value for Consul? The race here is fairly short; you might be able to reduce the risk of it (but not eliminate it entirely) by increasing that polling window. I recognize this is a workaround rather than a fix. In any case, I'd expect a production load balancer to be able to handle a one-off dropped request if only because of CAP; there might be some configuration values in traefik you need to tweak here.

Unfortunately neither Traefik nor HAProxy (which we also tried) are able to work around this issue directly. They can both be configured with their own health checks, but newly discovered services are by default healthy, which makes it difficult.

The workaround we're using now is having a custom http proxy between Traefik and Consul. Using the http provder in Traefik, and having the proxy only forward services that have a registered health check, works. But it is far from ideal.

@apollo13
Copy link
Contributor

apollo13 commented Jun 9, 2021

@madsholden Could you test my patch? The more people testing it the sooner it will probably get merged.

@rlandingham
Copy link

We are running into a similar issue.

Our Setup:
Waypoint
Nomad
Consul
Traffik

We are doing Canary deployments, and no matter what we set the shutdown_delay, it seems that when the old task is being deregistered we are seeing blips where we get 502 going to the application. Originally, we thought shutdown_delay would resolve the issue but it almost seems like there is a split second/seconds where there is no config for the traffic to the proper app.

@apollo13
Copy link
Contributor

apollo13 commented Aug 2, 2022

@rlandingham could you test my patch (see a few comments above)? If it fixes it for you a PR would certainly be the best step forward.

@shoenig shoenig self-assigned this Oct 17, 2022
@shoenig shoenig added this to the 1.5.0 milestone Oct 17, 2022
@shoenig shoenig modified the milestones: 1.5.0, 1.4.2 Oct 18, 2022
shoenig added a commit that referenced this issue Oct 18, 2022
This PR updates Nomad's Consul service client to include checks in
an initial service registration, so that the checks associated with
the service are registered "atomically" with the service. Before, we
would only register the checks after the service registration, which
causes problems where the service is deemed healthy, even if one or
more checks are unhealthy - especially problematic in the case where
SuccessBeforePassing is configured.

Fixes #3935
shoenig added a commit that referenced this issue Oct 18, 2022
This PR updates Nomad's Consul service client to include checks in
an initial service registration, so that the checks associated with
the service are registered "atomically" with the service. Before, we
would only register the checks after the service registration, which
causes problems where the service is deemed healthy, even if one or
more checks are unhealthy - especially problematic in the case where
SuccessBeforePassing is configured.

Fixes #3935
shoenig added a commit that referenced this issue Oct 19, 2022
…14944)

* consul: register checks along with service on initial registration

This PR updates Nomad's Consul service client to include checks in
an initial service registration, so that the checks associated with
the service are registered "atomically" with the service. Before, we
would only register the checks after the service registration, which
causes problems where the service is deemed healthy, even if one or
more checks are unhealthy - especially problematic in the case where
SuccessBeforePassing is configured.

Fixes #3935

* cr: followup to fix cause of extra consul logging

* cr: fix another bug

* cr: fixup changelog
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.