Services are registered without health check #7736

madsholden · 2020-04-17T11:51:35Z

Nomad version

0.10.4

Operating system and Environment details

Ubuntu 18.04.4

Issue

We have been experiencing some slight downtime on redeployments. It seems to be caused by Nomad registering the new services in Consul in two steps, first the service itself, then the health checks. This causes our load balancer (Traefik) to pick up the new set of instances right away, and then remove them again when the health check is added (in critical state). This happens very quickly, so the services are only registered for a split second in Traefik, but it is enough to lose some requests.

Reproduction steps

Deploy a service with a health check defined
Continuously call Consul API: /v1/health/service/:service
For a split second you will see the "Checks" list only contain the "serfHealth" check.
The next call to the API will show two items in the "Checks" list, the new one being the one defined by the job.

The issue is reproducible with this job file:

job "demo-webapp" {
  datacenters = [ "eu-west-1a", "eu-west-1b", "eu-west-1c" ]

  type = "service"

  update {
    max_parallel     = 2
    canary           = 2
    auto_revert      = true
    auto_promote     = true
    health_check     = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "2m"
  }

  group "demo-webapp" {
    count = 2

    task "demo-webapp" {
      driver = "docker"

      config {
        image = "hashicorp/demo-webapp-lb-guide"
      }

      env {
        PORT    = "${NOMAD_PORT_http}"
        NODE_IP = "${NOMAD_IP_http}"
      }

      service {
        name = "demo-webapp"
        port = "http"

        check {
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

      resources {
        network {
          port "http" {}
        }
      }
    }
  }
}

Here are two consecutive calls to /v1/health/service/demo-webapp during a redeployment of the job above. From the first call it looks like the service is healthy, but it is only because of the missing health check.

{
    "Node": {
      "ID": "a3c963c0-8a2a-3e42-c602-5a91de74b2cf",
      "Node": "ip-10-48-12-132",
      "Address": "10.48.12.132",
      "Datacenter": "eu-west-1",
      "TaggedAddresses": {
        "lan": "10.48.12.132",
        "wan": "10.48.12.132"
      },
      "Meta": {
        "consul-network-segment": ""
      },
      "CreateIndex": 62890176,
      "ModifyIndex": 62890178
    },
    "Service": {
      "ID": "_nomad-task-da0566c3-60c6-147d-266c-2841a27c97ac-demo-webapp-demo-webapp-http",
      "Service": "demo-webapp",
      "Tags": [],
      "Address": "10.48.12.132",
      "Port": 30721,
      "EnableTagOverride": false,
      "CreateIndex": 66244524,
      "ModifyIndex": 66244524
    },
    "Checks": [
      {
        "Node": "ip-10-48-12-132",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "ServiceTags": [],
        "Definition": {},
        "CreateIndex": 62890176,
        "ModifyIndex": 62890176
      }
    ]
  }

{
    "Node": {
      "ID": "a3c963c0-8a2a-3e42-c602-5a91de74b2cf",
      "Node": "ip-10-48-12-132",
      "Address": "10.48.12.132",
      "Datacenter": "eu-west-1",
      "TaggedAddresses": {
        "lan": "10.48.12.132",
        "wan": "10.48.12.132"
      },
      "Meta": {
        "consul-network-segment": ""
      },
      "CreateIndex": 62890176,
      "ModifyIndex": 62890178
    },
    "Service": {
      "ID": "_nomad-task-da0566c3-60c6-147d-266c-2841a27c97ac-demo-webapp-demo-webapp-http",
      "Service": "demo-webapp",
      "Tags": [],
      "Address": "10.48.12.132",
      "Port": 30721,
      "EnableTagOverride": false,
      "CreateIndex": 66244524,
      "ModifyIndex": 66244524
    },
    "Checks": [
      {
        "Node": "ip-10-48-12-132",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "ServiceTags": [],
        "Definition": {},
        "CreateIndex": 62890176,
        "ModifyIndex": 62890176
      },
      {
        "Node": "ip-10-48-12-132",
        "CheckID": "_nomad-check-c62db7320a864a7e27345a2ce315642a81265673",
        "Name": "service: \"demo-webapp\" check",
        "Status": "critical",
        "Notes": "",
        "Output": "",
        "ServiceID": "_nomad-task-da0566c3-60c6-147d-266c-2841a27c97ac-demo-webapp-demo-webapp-http",
        "ServiceName": "demo-webapp",
        "ServiceTags": [],
        "Definition": {},
        "CreateIndex": 66244539,
        "ModifyIndex": 66244539
      }
    ]
  }

The text was updated successfully, but these errors were encountered:

spuder · 2020-04-19T21:28:35Z

It looks like the health check is inheriting its name from the service. Do you have the same problem if you give the check a unique name?

      service {
        name = "demo-webapp"
        port = "http"

        check {
         name = "demo-webapp-healthcheck" # <= define the name here
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

Possibly related? #7709

madsholden · 2020-04-20T08:13:42Z

Thank you for the tip. I tried it, but unfortunately it did not make a difference.

madsholden · 2020-04-20T10:40:41Z

I just noticed this looks like the exact same problem as #3935. Any update on this @preetapan ?

It is blocking us from using Nomad in our production environment, we can't afford to lose requests every time we deploy.

tgross · 2021-01-26T16:23:00Z

Hi folks, I'm going to close this issue as a dupe of #3935 in the interest of helping us surface some of the older papercuts we need to get fixed.

github-actions · 2022-10-24T02:45:43Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/consul stage/needs-investigation labels Jun 22, 2020

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Jan 26, 2021

tgross closed this as completed Jan 26, 2021

tgross added the stage/duplicate label Jan 26, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Services are registered without health check #7736

Services are registered without health check #7736

madsholden commented Apr 17, 2020 •

edited

spuder commented Apr 19, 2020 •

edited

madsholden commented Apr 20, 2020

madsholden commented Apr 20, 2020

tgross commented Jan 26, 2021

github-actions bot commented Oct 24, 2022

Services are registered without health check #7736

Services are registered without health check #7736

Comments

madsholden commented Apr 17, 2020 • edited

Nomad version

Operating system and Environment details

Issue

Reproduction steps

The issue is reproducible with this job file:

spuder commented Apr 19, 2020 • edited

madsholden commented Apr 20, 2020

madsholden commented Apr 20, 2020

tgross commented Jan 26, 2021

github-actions bot commented Oct 24, 2022

madsholden commented Apr 17, 2020 •

edited

spuder commented Apr 19, 2020 •

edited