Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promoting a canary causes its service check to immediately become critical #4566

Closed
byronwolfman opened this issue Aug 8, 2018 · 25 comments · Fixed by #5536
Closed

Promoting a canary causes its service check to immediately become critical #4566

byronwolfman opened this issue Aug 8, 2018 · 25 comments · Fixed by #5536

Comments

@byronwolfman
Copy link
Contributor

byronwolfman commented Aug 8, 2018

Nomad version

Nomad v0.8.4 (dbee1d7)
Consul v1.2.0

Operating system and Environment details

CentOS Linux release 7.5.1804 (Core) (EC2)

Issue

Our engineering teams have noticed that their web apps sometimes brown out whenever we promote canaries. Many of these apps are very small, and we only run two containers. Therefore in our update stanza we set canary = 2 to do a blue-green update, wait for the canaries to become healthy, and then promote them. We use consul and consul-template to add services to our load balancers so that as new services become healthy, they're entered into the catalog and consequently into the load balancer config.

What we've noticed is that at the moment of promotion, suddenly the service seems to disappear from the load balancer. The 2 old containers are removed from the consul catalog because we've just promoted the job, and that's expected. But the new 2 containers also seem to disappear, even though they were healthy a moment ago. They'll then re-appear a few seconds later, but not before we drop a bunch of traffic on the floor.

As far as I can tell this is because when a nomad task is promoted, it service definition in consul is replaced. This is kind of bad since all consul services start out critical by default.

Reproduction steps

You can watch this in real-time if nomad and consul are both running. Deploy a job to nomad which will register with consul, and then deploy an update with a canary. You can look at service definitions in Consul with:

$ curl -Ss 127.0.0.1:8500/v1/health/checks/training-wheels | egrep 'ServiceID|Status'
        "Status": "passing",
        "ServiceID": "_nomad-task-kigfngvz6escims3tgkv6wz774op3w5a",   << old container
        "Status": "passing",
        "ServiceID": "_nomad-task-zas7siokij4xhdj5bigqxhzuynixm2kf",   << new container

So far so good. As soon as the job is promoted, however:

$ curl -Ss 127.0.0.1:8500/v1/health/checks/training-wheels | egrep 'ServiceID|Status'
        "ServiceID": "_nomad-task-2rmpcuvevadkpuvvduwu6my33yxqoo3v",   << new container; also a new service definition
        "Status": "critical",

This is a new service definition in Consul, and therefore its healthcheck starts out critical. We have a couple of ways we can work around this but neither are appealing:

  1. Deploy a single canary instead of blue-greening the whole thing. Not really desirable since this doubles deployment time. This isn't that great of a workaround for a service with only two allocations anyway, since it will roll over the second allocation the moment the first one becomes (temporarily) healthy.
  2. Set initial_status = "passing" in the service definition. Potentially worse, since this means that unhealthy services will immediately start accepting traffic ("unhealthy" meaning an unready service, but also a completely broken service).

In other words we're in a pretty bad place right now.

Job file (if appropriate)

A playground app of ours:

job "training-wheels" {

  datacenters = ["vagrant"]
  type = "service"

  update {
    max_parallel = 1
    min_healthy_time = "5s"
    healthy_deadline = "5m"
    health_check = "checks"
    auto_revert = true
    canary = 1
  }

  migrate {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "5s"
    healthy_deadline = "5m"
  }

  group "training-wheels" {
    count = 1
    restart {
      attempts = 10
      interval = "5m"
      delay = "10s"
      mode = "delay"
    }

    task "training-wheels" {
      driver = "docker"
      config {
        image = "redacted-repo/training-wheels:0.1.5"
        command = "/apps/shinatra.sh"
        hostname = "training-wheels"
        dns_servers = ["172.17.0.1"]
        port_map {
          http = 8080
        }
      }

      shutdown_delay = "45s"

      resources {
        cpu    = 250
        memory = 128
        network {
          mbits = 100
          port "http" {}
        }
      }

      service {
        name = "training-wheels"
        port = "http"
        tags = [
          "0.1.5",
          "backend",
          "http",
          "microservice"
        ]
        check {
          name     = "healthcheck"
          type     = "http"
          port     = "http"
          path     = "/api/healthcheck"
          interval = "15s"
          timeout  = "10s"

          check_restart {
            limit = 3
            grace = "1m"
          }
        }
      }
    }
  }
}
@byronwolfman
Copy link
Contributor Author

For fun, I fired up an old test environment that's running an older version (Nomad v0.7.1 (0b295d3)). This problem does not occur on 0.7.1 since consul service definitions are not replaced.

Using the same workflow as above, I can check during deploy:

        "Status": "passing",
        "ServiceID": "_nomad-task-sinvvhyefh5ky7hssxywbvp7lqedkpkv",   << new container
        "Status": "passing",
        "ServiceID": "_nomad-task-qz6w35hqdsklk4x62iozivgzr6fkpqiw",   << old container

And then after promoting:

        "Status": "passing",
        "ServiceID": "_nomad-task-sinvvhyefh5ky7hssxywbvp7lqedkpkv",   << new container, same ID

@shantanugadgil
Copy link
Contributor

shantanugadgil commented Aug 9, 2018

Wow, I too had observed such behavior when using a test setup of Dockers and using blue green promotion.

I had marked it as something I had done wrong among the myriad of options between check_restart and restart stanzas.

but I guess others are facing a similar issue.

I haven't tried using a previous version to see if blue green works.

BTW, the usual rolling upgrade setting works, i.e. service is online when I upgrade only a few at a time.

At the time of my observation, I was using Nomad 0.8.4 and Consul 1.2.1.

@byronwolfman
Copy link
Contributor Author

I've tested this on three versions of nomad and two versions of consul now. The behaviour seems to be consul version-agnostic, but FWIW I've performed my testing with consul 1.0.2 and 1.2.0. As for nomad:

Nomad v0.7.1 (0b295d3) - not affected
Nomad v0.8.3 (c85483d) - not affected
Nomad v0.8.4 (dbee1d7) - affected

My best guess is that this new behaviour was introduced via #4259, specifically these lines:

https://github.com/hashicorp/nomad/pull/4259/files#diff-8612d5e61f84976db572b3f38eb380c0R1056
https://github.com/hashicorp/nomad/pull/4259/files#diff-396d9428285e4245041a3d0b15546d5dR3871

The consul service ID is hashed based on allocation ID + task name to avoid collisions, and as of that PR the canary bool as well. Since the canary stops being a canary when the job is promoted, the hash is changed and a new consul service registration is essentially forced. The new service is consequently critical by default.

I'm not familiar enough with Nomad's codebase to know if including the canary status in the hash is necessary or not, but if it isn't, then excluding it from the hashing function is probably the easiest/best solution. Perhaps it's not as straightforward though (plus, the canary tags are also hashed in this function, which will cause this behaviour for anyone who uses them).

@shantanugadgil
Copy link
Contributor

@byronwolfman just a thought, could the following help (to mark the status a passing by default) and prevent the bounce ? 💡
https://www.nomadproject.io/docs/job-specification/service.html#initial_status

@ramSeraph
Copy link

saw this too.. with nomad 0.8.4 and consul 1.2.0 and this happened when I was trying out canary tags.. I thought it was a misconfiguration on my part and didn't get a chance to really check it out.

@byronwolfman
Copy link
Contributor Author

@byronwolfman just a thought, could the following help (to mark the status a passing by default) and prevent the bounce ? 💡
https://www.nomadproject.io/docs/job-specification/service.html#initial_status

Yep, but I've described why this is almost worse than the service check flapping to critical. By setting initial_status = "passing", the task will immediately enter into the catalog and start receiving traffic whether it's ready or not ("not" meaning that it is still starting up, or that it is panic-looping).

We don't want to advertise the task until it is ready to receive traffic, which is why it makes sense for consul to start all registered services as "critical" until their healthchecks are passing. Unfortunately this has unintended consequences for the canary.

Having thought about this, what we will probably do is to set initial_status = "passing" for tasks, but drastically increase our mininum wait time in consul-template, so that it converges on changes more slowly. A newly-started task will start as "passing" and then become "critical" a few seconds later if it is not done starting up (or if it's faily) and if consul-template converges slowly enough, we'll never send it traffic even if it was technically in the catalog for a short period of time. We consider this a workaround at best and it will probably come with its own unintended side-effects, so I'm hopeful the root issue is solvable.

We're also considering rolling back to 0.8.3, but 0.8.4 is so stable otherwise that I'm not too stoked about the idea of a rollback.

@shantanugadgil
Copy link
Contributor

@byronwolfman apologies, I didn't see you have already mentioned setting the initial_status as passing in your original post.

I too had thought about the side effects of marking something healthy when it is not.
Thankfully, the apps where I use it have retry mechanisms which let this behavior slide. 😉

Regards,
Shantanu

@byronwolfman
Copy link
Contributor Author

Hi all, any traction on this?

@lpicanco
Copy link

lpicanco commented Sep 5, 2018

An alternative would be to update one consul service at a time, avoiding to leave all of them in a critical state at the same time. What do you think?

@byronwolfman
Copy link
Contributor Author

An alternative would be to update one consul service at a time, avoiding to leave all of them in a critical state at the same time. What do you think?

I appreciate this (and other) well-meaning workarounds, but this is not our desired workflow, and doubles the time it takes to deploy an app. I opened this issue because the change breaks a documented workflow for Blue/Green deployments here:

https://www.nomadproject.io/guides/operating-a-job/update-strategies/blue-green-and-canary-deployments.html

The breaking change here isn't to an undocumented edge case that functioned as a side-effect of how nomad and consul work together; it's a breaking change that invalidates the entire above document.

@mildred
Copy link
Contributor

mildred commented Sep 17, 2018

(plus, the canary tags are also hashed in this function, which will cause this behaviour for anyone who uses them).

And that's a problem. How can you use blue/green and have traffic only redirecting to canaries only or promoted allocs only if you have no way to differentiate between the two ?

Yep, but I've described why this is almost worse than the service check flapping to critical. By setting initial_status = "passing", the task will immediately enter into the catalog and start receiving traffic whether it's ready or not ("not" meaning that it is still starting up, or that it is panic-looping).

I don't see how it can be a problem if that's done by Nomad on Consul service replacement. initial_status = "passing" would need to be set only for services updated after canary deployment that re known to be passing already.

@mildred
Copy link
Contributor

mildred commented Sep 17, 2018

@schmichael and @preetapan You are the ones who improved the deployment in commits indicated by #4566 (comment) (more specifically 17c6eb8#diff-8612d5e61f84976db572b3f38eb380c0L1043). Do you know why the consul service id is a hash of the canary flag (and the canary tags)? Why was it done that way? Could we consider removing the canary flag and tags from the hash to fix this issue ?

@preetapan
Copy link
Member

@mildred I addressed this in a comment on the mailing list https://groups.google.com/forum/#!topic/nomad-tool/GatYXal9evg

TLDR - Using the service ID's hash to encapsulate all its internal state(like canary status/tag values) was in retrospect, not the right design decision and has been a source of brittleness. We are working on redesigning this for 0.9.

@jippi
Copy link
Contributor

jippi commented Sep 17, 2018

@preetapan We are working on redesigning this for 0.9. - can you please include the alloc id or something easy to-compute way to match the service id to an allocation? its been a huge pain in the butt for hashi-ui to try and reverse engineer the hashing being done within nomad for those consul service ids. Even just blindly appending the full alloc id to the hash would be nice

@mildred
Copy link
Contributor

mildred commented Nov 9, 2018

As you are redesigning this, perhaps it would be a good time to implement #3636: an option to enable automatic promotion of canaries when they are all healthy

@mildred
Copy link
Contributor

mildred commented Nov 23, 2018

@preetapan do you have news about the planning of this issue for Nomad 0.9 ? Is it planned for 0.9.0 or a later release of the 0.9.x branch ?

@byronwolfman
Copy link
Contributor Author

Any update on this? The bug has existed for 8 months now (reported 6 months ago). I know it's planned to be fixed in 0.9.x but that release seems to keep getting delayed, and the massive scope of 0.9.0 frankly makes us wary of upgrading. The discussion in this thread indicates that the bug will not be fixed in 0.9.0, but in a later point release (so 0.9.1 at earliest), and then we're going to need to wait for at least the next point release to make sure that there were no show-stoppers.

I feel like by the time we can actually start using the blue-green pattern again, a full year will have passed. In the meanwhile, https://www.nomadproject.io/guides/operating-a-job/update-strategies/blue-green-and-canary-deployments.html continues to advertise an update strategy that does not actually work.

Is it not possible to backport something -- anything -- into 0.8.x? Is it possible to just have a configuration option to disable canary tagging in consul, since this is the source of the bug?

@byronwolfman
Copy link
Contributor Author

byronwolfman commented Feb 18, 2019

So, I probably should have done this a lot sooner. I've put together a quick and dirty patch against v0.8.4 that disables canary tagging. I've erred on the side of making as few code changes as possible since the PR that introduced canary tags (#4259) is massive. This means that rather than a comprehensive cleanup effort that includes refactors and test coverage (aka the correct approach), you're getting the equivalent of a handful of commented-out lines.


This modifies the service.Hash() function; it still accepts the canary argument but now ignores it. This means that consul service IDs are consistent and unchanging when a job is promoted from canary:

diff --git a/nomad/structs/structs.go b/nomad/structs/structs.go
index 969f11338..49efad7fe 100644
--- a/nomad/structs/structs.go
+++ b/nomad/structs/structs.go
@@ -3930,11 +3930,6 @@ func (s *Service) Hash(allocID, taskName string, canary bool) string {
 		io.WriteString(h, tag)
 	}

-	// Vary ID on whether or not CanaryTags will be used
-	if canary {
-		h.Write([]byte("Canary"))
-	}
-
 	// Base32 is used for encoding the hash as sha1 hashes can always be
 	// encoded without padding, only 4 bytes larger than base64, and saves
 	// 8 bytes vs hex. Since these hashes are used in Consul URLs it's nice

The next diff modifies ServiceClient.serviceRegs() to no longer add the canary tag to consul services. Maybe this change isn't necessary, and Nomad is capable of updating the service in the consul catalog to add/remove the canary tag without having to change the service ID, but I wonder if that might introduce subtle unexpected behaviour, so I'm just taking a sledgehammer to it for now.

The change is meant to be revert the commit that introduced them, comments included.

diff --git a/command/agent/consul/client.go b/command/agent/consul/client.go
index 7ca1fc845..3a0d6c7f3 100644
--- a/command/agent/consul/client.go
+++ b/command/agent/consul/client.go
@@ -641,24 +641,17 @@ func (c *ServiceClient) serviceRegs(ops *operations, service *structs.Service, t
 		return nil, fmt.Errorf("unable to get address for service %q: %v", service.Name, err)
 	}

-	// Determine whether to use tags or canary_tags
-	var tags []string
-	if task.Canary && len(service.CanaryTags) > 0 {
-		tags = make([]string, len(service.CanaryTags))
-		copy(tags, service.CanaryTags)
-	} else {
-		tags = make([]string, len(service.Tags))
-		copy(tags, service.Tags)
-	}
-
 	// Build the Consul Service registration request
 	serviceReg := &api.AgentServiceRegistration{
 		ID:      id,
 		Name:    service.Name,
-		Tags:    tags,
+		Tags:    make([]string, len(service.Tags)),
 		Address: ip,
 		Port:    port,
 	}
+	// copy isn't strictly necessary but can avoid bugs especially
+	// with tests that may reuse Tasks
+	copy(serviceReg.Tags, service.Tags)
 	ops.regServices = append(ops.regServices, serviceReg)

 	// Build the check registrations

I've been playing around with this in a sandbox environment and so far it's behaving as expected, so we might fire it into our test cluster later this week.

If you want to try it out in your own test environment, this repo provides you all the tools you need to build your own binary:

  1. Clone this repo
  2. Checkout the tagged release with git checkout v0.8.4
  3. Apply the above diff/patches
  4. Consider also setting a value for VersionPrerelease and VersionMetadata in version/version.go.
  5. Launch vagrant with vagrant up
  6. Inside vagrant, build nomad with make pkg/linux_amd64/nomad (or make help if you need to target a different platform) and retrieve the binary from ${repo}/pkg/${target}/nomad

A few other things in no particular order:

  1. This will cause tests to fail, since test coverage includes making sure that canary tags work.
  2. The diffs are against 0.8.4 in order to have as few changes as possible; maybe they'll work against later 0.8.x releases but I haven't checked.
  3. Use at your own risk, especially if you launch this into a production or production-like cluster. I accept no responsibility for any cluster meltdowns that might result from running code based on my gross misunderstanding of how much work it takes to properly fix this sort of thing.

@byronwolfman
Copy link
Contributor Author

Hey friends and fellow nomad operators, good(ish) news; after soaking the patch on our test environments for a week, we've rolled this to prod and have our safe blue/greens back.

I can't recommend this as general fix for everyone running a >= 0.8.4 cluster, but if you have a very strong understanding of how your cluster operates and what kind of jobs are being sent to it, you're not using the canary tags feature and you have a very robust way of testing and validating this kind of patch, well... It works. Use at your own risk of course, since the "fix" is that it disables code that it hopes you won't try to use.

If you're not in a hurry though, then it's probably still best to wait for an official (and correct) fix in 0.9.x.

@deanmaniatis
Copy link

bump

@schmichael
Copy link
Member

Sorry for the lack of communication on this bug. It will not make it into 0.9.0 but is one of my highest priorities after release.

@byronwolfman
Copy link
Contributor Author

🎉 🎉 🎉 🎉

@fat0
Copy link

fat0 commented May 18, 2019

As far as I can tell, this issue has not been fixed in Nomad 0.9.1 (Consul 1.3.1).

Consul health status with 2 containers + 1 canary running, before promote:

curl -sX GET "https://${CONSUL_HTTP_ADDR}/v1/health/checks/${JOB_NAME}" | jq -r '.[] | {"ServiceID": .ServiceID, "Status": .Status}'
{
  "ServiceID": "_nomad-task-hgyyhb4l35rsechp3fuidlgf4adjepi7",    << old container
  "Status": "passing"
}
{
  "ServiceID": "_nomad-task-uyxxandbftbl3kwcmrcexm2th72idfj7",    << old container
  "Status": "passing"
}
{
  "ServiceID": "_nomad-task-pd5kaqu6fg27nu26kt3lbnctc7k523xs",    << canary
  "Status": "passing"
}

Consul health status after promote, all previous service definitions have been deregistered:

curl -sX GET "https://${CONSUL_HTTP_ADDR}/v1/health/checks/${JOB_NAME}" | jq -r '.[] | {"ServiceID": .ServiceID, "Status": .Status}'
{
  "ServiceID": "_nomad-task-jiiw6hyq5jwi2lgoczvy3t64l5knqhtx",    << new container
  "Status": "critical"
}

@byronwolfman
Copy link
Contributor Author

Hey! It seems that 0.9.1 was put out early as a bug fix release, but the fix should be in 0.9.2 along with some other really neat lifecycle stuff: 4ddb4bb#diff-4ac32a78649ca5bdd8e0ba38b7006a1eR26

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.