Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Services are registered without health check #7736

Closed
madsholden opened this issue Apr 17, 2020 · 5 comments
Closed

Services are registered without health check #7736

madsholden opened this issue Apr 17, 2020 · 5 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/duplicate theme/consul

Comments

@madsholden
Copy link

madsholden commented Apr 17, 2020

Nomad version

0.10.4

Operating system and Environment details

Ubuntu 18.04.4

Issue

We have been experiencing some slight downtime on redeployments. It seems to be caused by Nomad registering the new services in Consul in two steps, first the service itself, then the health checks. This causes our load balancer (Traefik) to pick up the new set of instances right away, and then remove them again when the health check is added (in critical state). This happens very quickly, so the services are only registered for a split second in Traefik, but it is enough to lose some requests.

Reproduction steps

  • Deploy a service with a health check defined
  • Continuously call Consul API: /v1/health/service/:service
    For a split second you will see the "Checks" list only contain the "serfHealth" check.
    The next call to the API will show two items in the "Checks" list, the new one being the one defined by the job.

The issue is reproducible with this job file:

job "demo-webapp" {
  datacenters = [ "eu-west-1a", "eu-west-1b", "eu-west-1c" ]

  type = "service"

  update {
    max_parallel     = 2
    canary           = 2
    auto_revert      = true
    auto_promote     = true
    health_check     = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "2m"
  }

  group "demo-webapp" {
    count = 2

    task "demo-webapp" {
      driver = "docker"

      config {
        image = "hashicorp/demo-webapp-lb-guide"
      }

      env {
        PORT    = "${NOMAD_PORT_http}"
        NODE_IP = "${NOMAD_IP_http}"
      }

      service {
        name = "demo-webapp"
        port = "http"

        check {
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

      resources {
        network {
          port "http" {}
        }
      }
    }
  }
}

Here are two consecutive calls to /v1/health/service/demo-webapp during a redeployment of the job above. From the first call it looks like the service is healthy, but it is only because of the missing health check.

{
    "Node": {
      "ID": "a3c963c0-8a2a-3e42-c602-5a91de74b2cf",
      "Node": "ip-10-48-12-132",
      "Address": "10.48.12.132",
      "Datacenter": "eu-west-1",
      "TaggedAddresses": {
        "lan": "10.48.12.132",
        "wan": "10.48.12.132"
      },
      "Meta": {
        "consul-network-segment": ""
      },
      "CreateIndex": 62890176,
      "ModifyIndex": 62890178
    },
    "Service": {
      "ID": "_nomad-task-da0566c3-60c6-147d-266c-2841a27c97ac-demo-webapp-demo-webapp-http",
      "Service": "demo-webapp",
      "Tags": [],
      "Address": "10.48.12.132",
      "Port": 30721,
      "EnableTagOverride": false,
      "CreateIndex": 66244524,
      "ModifyIndex": 66244524
    },
    "Checks": [
      {
        "Node": "ip-10-48-12-132",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "ServiceTags": [],
        "Definition": {},
        "CreateIndex": 62890176,
        "ModifyIndex": 62890176
      }
    ]
  }
{
    "Node": {
      "ID": "a3c963c0-8a2a-3e42-c602-5a91de74b2cf",
      "Node": "ip-10-48-12-132",
      "Address": "10.48.12.132",
      "Datacenter": "eu-west-1",
      "TaggedAddresses": {
        "lan": "10.48.12.132",
        "wan": "10.48.12.132"
      },
      "Meta": {
        "consul-network-segment": ""
      },
      "CreateIndex": 62890176,
      "ModifyIndex": 62890178
    },
    "Service": {
      "ID": "_nomad-task-da0566c3-60c6-147d-266c-2841a27c97ac-demo-webapp-demo-webapp-http",
      "Service": "demo-webapp",
      "Tags": [],
      "Address": "10.48.12.132",
      "Port": 30721,
      "EnableTagOverride": false,
      "CreateIndex": 66244524,
      "ModifyIndex": 66244524
    },
    "Checks": [
      {
        "Node": "ip-10-48-12-132",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "ServiceTags": [],
        "Definition": {},
        "CreateIndex": 62890176,
        "ModifyIndex": 62890176
      },
      {
        "Node": "ip-10-48-12-132",
        "CheckID": "_nomad-check-c62db7320a864a7e27345a2ce315642a81265673",
        "Name": "service: \"demo-webapp\" check",
        "Status": "critical",
        "Notes": "",
        "Output": "",
        "ServiceID": "_nomad-task-da0566c3-60c6-147d-266c-2841a27c97ac-demo-webapp-demo-webapp-http",
        "ServiceName": "demo-webapp",
        "ServiceTags": [],
        "Definition": {},
        "CreateIndex": 66244539,
        "ModifyIndex": 66244539
      }
    ]
  }
@spuder
Copy link
Contributor

spuder commented Apr 19, 2020

It looks like the health check is inheriting its name from the service. Do you have the same problem if you give the check a unique name?

      service {
        name = "demo-webapp"
        port = "http"

        check {
         name = "demo-webapp-healthcheck" # <= define the name here
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

Possibly related? #7709

@madsholden
Copy link
Author

Thank you for the tip. I tried it, but unfortunately it did not make a difference.

@madsholden
Copy link
Author

I just noticed this looks like the exact same problem as #3935. Any update on this @preetapan ?

It is blocking us from using Nomad in our production environment, we can't afford to lose requests every time we deploy.

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Jan 26, 2021
@tgross
Copy link
Member

tgross commented Jan 26, 2021

Hi folks, I'm going to close this issue as a dupe of #3935 in the interest of helping us surface some of the older papercuts we need to get fixed.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/duplicate theme/consul
Projects
None yet
Development

No branches or pull requests

3 participants