Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Health Check status not updating after service status turns critical #3261

Closed
Btlyons1 opened this issue Sep 22, 2017 · 3 comments

Comments

@Btlyons1
Copy link

Nomad version

Nomad 0.6.3
Consul v0.8.1

Operating system and Environment details

Running local development envrionment for our internal PaaS.
virtual machine BusyBox v1.24.2 for docker-machine
docker-compose (1.7.1)
Nomad is running as a container with an Apline:latest base image.

Issue

Health status not updating even though service status in Consul is critical.
The container that the job brings up has an endpoint to change the http response code.
The initial response status is 200.
I change the response from 200 to 500 via POST to an endpoint.
I would then expect the health status to change to unhealthy.
We are looking to add a feature where unhealthy allocations get replaced first when scaling down instead of the healthy ones.

Reproduction steps

I run the job defined in the job file section below.

nomad run example.nomad

Check the status to ensure it is healhty.

/ # nomad status example
ID            = example
Name          = example
Submit Date   = 09/22/17 15:07:29 UTC
Type          = service
Priority      = 50
Datacenters   = dev
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       1         0

Latest Deployment
ID          = 90a455e5
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
cache       1        1       1        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
9d134111  c6bee743  cache       2        run      running   09/22/17 15:07:29 UTC

Check Consul

curl localhost:8500/v1/health/state/any | jq .

  {
    "Node": "quadradev",
    "CheckID": "756f32c0e9082952ff8b82edb374bb166b06d9f7",
    "Name": "alive",
    "Status": "passing",
    "Notes": "",
    "Output": "HTTP GET http://10.0.2.15:26577/status: 200 OK Output: {\"Status\":\"200 OK\"}",
    "ServiceID": "_nomad-executor-9d134111-2f6e-b3e5-3575-cc9b78f0f679-testme-global-hc-check-global-cache",
    "ServiceName": "global-hc-check",
    "CreateIndex": 5198,
    "ModifyIndex": 5200
  },

Change the stutus to 500

curl localhost/status
{"Status":"200 OK"}

curl -X POST localhost/status/toggle/500
{"Status":500}

Check Consul to ensure status is critical

curl localhost:8500/v1/health/state/any | jq .

 {
    "Node": "quadradev",
    "CheckID": "756f32c0e9082952ff8b82edb374bb166b06d9f7",
    "Name": "alive",
    "Status": "critical",
    "Notes": "",
    "Output": "HTTP GET http://10.0.2.15:26577/status: 500 Internal Server Error Output: {\"Status\":500}",
    "ServiceID": "_nomad-executor-9d134111-2f6e-b3e5-3575-cc9b78f0f679-testme-global-hc-check-global-cache",
    "ServiceName": "global-hc-check",
    "CreateIndex": 5198,
    "ModifyIndex": 5242
  },

Now that the status is critical, I would expect the deployment to be unhealhty

/ # nomad status example
ID            = example
Name          = example
Submit Date   = 09/22/17 15:07:29 UTC
Type          = service
Priority      = 50
Datacenters   = dev
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       1         0

Latest Deployment
ID          = 90a455e5
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
cache       1        1       1        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created At
9d134111  c6bee743  cache       2        run      running   09/22/17 15:07:29 UTC

Check the allocation logs

/ # nomad logs 9d134111
10.0.2.15 - - [22/Sep/2017:15:07:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:07:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:08:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:08:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:08:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:08:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:08:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:08:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:09:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:09:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:09:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:09:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:09:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:09:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:10:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:10:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:10:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:10:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:10:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:10:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:11:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:11:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:11:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:11:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:11:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:11:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:12:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:12:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:12:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:12:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:12:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:12:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:13:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:13:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:13:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:13:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:13:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:13:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:14:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:14:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:14:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:14:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:14:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:14:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:15:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:15:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:15:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:15:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:15:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:15:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:16:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:16:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:16:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:16:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:16:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:16:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:17:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:17:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:17:22 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:17:32 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:17:42 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:17:52 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:18:02 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:18:12 +0000] "GET /status HTTP/1.1" 200 19
10.0.2.15 - - [22/Sep/2017:15:18:22 +0000] "GET /status HTTP/1.1" 200 19
::1 - - [22/Sep/2017:15:18:29 +0000] "POST /status/toggle/500 HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:18:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:18:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:18:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:19:02 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:19:12 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:19:22 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:19:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:19:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:19:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:20:02 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:20:12 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:20:22 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:20:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:20:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:20:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:21:02 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:21:12 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:21:22 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:21:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:21:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:21:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:22:02 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:22:12 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:22:22 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:22:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:22:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:22:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:23:02 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:23:12 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:23:22 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:23:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:23:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:23:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:24:02 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:24:12 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:24:22 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:24:32 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:24:42 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:24:52 +0000] "GET /status HTTP/1.1" 500 14
10.0.2.15 - - [22/Sep/2017:15:25:02 +0000] "GET /status HTTP/1.1" 500 14

Job file

job "example" {
  datacenters = ["dev"]
  type = "service"
  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    auto_revert = false
    canary = 0
  }
  group "cache" {
    count = 2
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    ephemeral_disk {
      size = 300
    }
    task "testme" {
      driver = "docker"
      config {
        image = "quadra/healthy"
        port_map {
          db = 80
        }
      }
      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
        network {
          port "db" {}
        }
      }
      service {
        name = "global-hc-check"
        tags = ["global", "cache"]
        port = "db"
        check {
          name     = "alive"
          type     = "http"
          interval = "10s"
          timeout  = "2s"
          path = "/status"
        }
      }
    }
  }
}
@dadgar
Copy link
Contributor

dadgar commented Sep 25, 2017

@Btlyons1 Hey the deployment object tracks the initial health of the newly placed allocation and is only valid during a rolling update or canary process. The deployment has entered a terminal status and is no longer being tracked. This is the desired behavior since it is being used to do a rolling update, not track the long term health of the allocation.

@dadgar dadgar closed this as completed Sep 25, 2017
@Btlyons1 Btlyons1 changed the title Health Check status not updating after service status turns critical [Question] Health Check status not updating after service status turns critical Sep 25, 2017
@Btlyons1
Copy link
Author

Btlyons1 commented Sep 25, 2017

@dadgar Makes sense. I missunderstood the docs. Thanks for the clarification.

@github-actions
Copy link

github-actions bot commented Dec 7, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants