[Question] Health Check status not updating after service status turns critical #3261

Btlyons1 · 2017-09-22T15:40:23Z

Nomad version

Nomad 0.6.3
Consul v0.8.1

Operating system and Environment details

Running local development envrionment for our internal PaaS.
virtual machine BusyBox v1.24.2 for docker-machine
docker-compose (1.7.1)
Nomad is running as a container with an Apline:latest base image.

Issue

Health status not updating even though service status in Consul is critical.
The container that the job brings up has an endpoint to change the http response code.
The initial response status is 200.
I change the response from 200 to 500 via POST to an endpoint.
I would then expect the health status to change to unhealthy.
We are looking to add a feature where unhealthy allocations get replaced first when scaling down instead of the healthy ones.

Reproduction steps

I run the job defined in the job file section below.

nomad run example.nomad

Check the status to ensure it is healhty.

/ # nomad status example
ID            = example
Name          = example
Submit Date   = 09/22/17 15:07:29 UTC
Type          = service
Priority      = 50
Datacenters   = dev
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       1         0

Latest Deployment
ID          = 90a455e5
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
cache       1        1       1        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
9d134111  c6bee743  cache       2        run      running   09/22/17 15:07:29 UTC

Check Consul

curl localhost:8500/v1/health/state/any | jq .

  {
    "Node": "quadradev",
    "CheckID": "756f32c0e9082952ff8b82edb374bb166b06d9f7",
    "Name": "alive",
    "Status": "passing",
    "Notes": "",
    "Output": "HTTP GET http://10.0.2.15:26577/status: 200 OK Output: {\"Status\":\"200 OK\"}",
    "ServiceID": "_nomad-executor-9d134111-2f6e-b3e5-3575-cc9b78f0f679-testme-global-hc-check-global-cache",
    "ServiceName": "global-hc-check",
    "CreateIndex": 5198,
    "ModifyIndex": 5200
  },

Change the stutus to 500

curl localhost/status
{"Status":"200 OK"}

curl -X POST localhost/status/toggle/500
{"Status":500}

Check Consul to ensure status is critical

curl localhost:8500/v1/health/state/any | jq .

 {
    "Node": "quadradev",
    "CheckID": "756f32c0e9082952ff8b82edb374bb166b06d9f7",
    "Name": "alive",
    "Status": "critical",
    "Notes": "",
    "Output": "HTTP GET http://10.0.2.15:26577/status: 500 Internal Server Error Output: {\"Status\":500}",
    "ServiceID": "_nomad-executor-9d134111-2f6e-b3e5-3575-cc9b78f0f679-testme-global-hc-check-global-cache",
    "ServiceName": "global-hc-check",
    "CreateIndex": 5198,
    "ModifyIndex": 5242
  },

Now that the status is critical, I would expect the deployment to be unhealhty

/ # nomad status example
ID            = example
Name          = example
Submit Date   = 09/22/17 15:07:29 UTC
Type          = service
Priority      = 50
Datacenters   = dev
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       1         0

Latest Deployment
ID          = 90a455e5
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
cache       1        1       1        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
9d134111  c6bee743  cache       2        run      running   09/22/17 15:07:29 UTC

Check the allocation logs

/ # nomad logs 9d134111
10.0.2.15 - - [22/Sep/2017:15:07:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:07:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:08:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:08:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:08:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:08:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:08:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:08:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:09:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:09:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:09:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:09:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:09:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:09:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:10:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:10:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:10:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:10:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:10:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:10:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:11:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:11:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:11:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:11:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:11:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:11:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:12:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:12:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:12:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:12:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:12:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:12:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:13:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:13:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:13:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:13:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:13:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:13:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:14:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:14:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:14:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:14:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:14:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:14:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:15:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:15:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:15:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:15:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:15:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:15:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:16:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:16:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:16:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:16:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:16:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:16:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:17:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:17:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:17:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:17:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:17:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:17:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:18:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:18:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:18:22 +0000] "GET /status HTTP/1.1" 200 19
::1 - - [22/Sep/2017:15:18:29 +0000] "POST /status/toggle/500 HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:18:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:18:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:18:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:19:02 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:19:12 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:19:22 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:19:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:19:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:19:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:20:02 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:20:12 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:20:22 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:20:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:20:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:20:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:21:02 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:21:12 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:21:22 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:21:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:21:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:21:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:22:02 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:22:12 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:22:22 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:22:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:22:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:22:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:23:02 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:23:12 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:23:22 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:23:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:23:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:23:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:24:02 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:24:12 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:24:22 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:24:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:24:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:24:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:25:02 +0000] "GET /status HTTP/1.1" 500 14

Job file

job "example" {
  datacenters = ["dev"]
  type = "service"
  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    auto_revert = false
    canary = 0
  }
  group "cache" {
    count = 2
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    ephemeral_disk {
      size = 300
    }
    task "testme" {
      driver = "docker"
      config {
        image = "quadra/healthy"
        port_map {
          db = 80
        }
      }
      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
        network {
          port "db" {}
        }
      }
      service {
        name = "global-hc-check"
        tags = ["global", "cache"]
        port = "db"
        check {
          name     = "alive"
          type     = "http"
          interval = "10s"
          timeout  = "2s"
          path = "/status"
        }
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

dadgar · 2017-09-25T17:55:43Z

@Btlyons1 Hey the deployment object tracks the initial health of the newly placed allocation and is only valid during a rolling update or canary process. The deployment has entered a terminal status and is no longer being tracked. This is the desired behavior since it is being used to do a rolling update, not track the long term health of the allocation.

Btlyons1 · 2017-09-25T19:33:37Z

@dadgar Makes sense. I missunderstood the docs. Thanks for the clarification.

github-actions · 2022-12-07T02:18:19Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar closed this as completed Sep 25, 2017

Btlyons1 changed the title ~~Health Check status not updating after service status turns critical~~ [Question] Health Check status not updating after service status turns critical Sep 25, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Health Check status not updating after service status turns critical #3261

[Question] Health Check status not updating after service status turns critical #3261

Btlyons1 commented Sep 22, 2017

dadgar commented Sep 25, 2017

Btlyons1 commented Sep 25, 2017 •

edited

github-actions bot commented Dec 7, 2022

[Question] Health Check status not updating after service status turns critical #3261

[Question] Health Check status not updating after service status turns critical #3261

Comments

Btlyons1 commented Sep 22, 2017

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file

dadgar commented Sep 25, 2017

Btlyons1 commented Sep 25, 2017 • edited

github-actions bot commented Dec 7, 2022

Btlyons1 commented Sep 25, 2017 •

edited