Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul Connect enabled jobs fail if using health check #7709

Closed
spuder opened this issue Apr 13, 2020 · 8 comments
Closed

Consul Connect enabled jobs fail if using health check #7709

spuder opened this issue Apr 13, 2020 · 8 comments
Assignees
Labels
theme/consul/connect Consul Connect integration theme/docs Documentation issues and enhancements type/enhancement

Comments

@spuder
Copy link
Contributor

spuder commented Apr 13, 2020

Nomad version

Nomad = 0.11.0
Consul = 1.7.2
ACL's = Enabled
Envoy = 1.13

Issue

Consul connect enabled jobs fail to connect in envoy if a health check is defined (even if the health check passes). Consul connect enable jobs work as expected if no health check is defined on the service.

Consul connect in nomad is new, and others have had trouble as reported here:

Setup

I have a connect enabled job in nomad named bar. I have a legacy vm called foo running ubuntu 18.04 with envoy installed.

foo:14002(vm) -> foo-sidecar-proxy:(envoy) --> bar-sidecar-proxy(nomad) -> bar:3000(nomad)

The vm running 'foo' has the following in /etc/consul/service_foobar.json

{
  "service": {
    "checks": [],
    "connect": {
      "sidecar_service": {
        "proxy": {
          "upstreams": [
            {
              "destination_name": "bar",
              "local_bind_port": 14002
            },
	    {
	      "destination_name": "count-api",
	      "local_bind_port": 15002
	    }
          ]
        }
      }
    },
    "enable_tag_override": false,
    "id": "foo",
    "name": "foo",
    "tags": []
  }
}

Envoy has been started with the following command

/usr/local/bin/consul connect envoy --sidecar-for foo -admin-bind localhost:19000

The following nomad job works correctly (note that it does not have a health check). The vm foo is able to communicate with the nomad job running on port 3000 through envoy

ssh foo
curl localhost:14002/actuator/health
{
  "groups": [],
  "status": {
    "code": "UP",
    "description": ""
  }
}
  group "group" {
    count = <%= @roam['variables']['count'] %>

    # Allow all containers in a group to share a private loopback interface
    network {
      mode = "bridge"
    }
  service { # Register this task in Consul and define health checks
      name = "bar"
      port = "3000"
      connect {
        sidecar_service {}
      }
    }

I'm not sure if this is a bug, or a documentation issue. Here are all the configurations that I have tried:


Attempt 1 This works and is able to communicate over the envoy proxy service mesh, but there is no health check.
  group "group" {
    count = <%= @roam['variables']['count'] %>

    # Allow all containers in a group to share a private loopback interface
    network {
      mode = "bridge"
    }
  service { # Register this task in Consul and define health checks
      name = "bar"
      port = "3000"
      connect {
        sidecar_service {}
      }
    }
curl localhost:14002/actuator/health
{
  "groups": [],
  "status": {
    "code": "UP",
    "description": ""
  }
}

Result: ✅

  • Job starts
  • Health check passes
  • Envoy is able to connect
Attempt 2
    network {
      mode = "bridge"
      port "http" {
        to = "3000"
      }
    }
  service {
      name = "bar"
      port = "http"
      check {
        port = "http"
        type = "http"
        path = "/actuator/health"
        interval = "10s"
        timeout = "2s"
        address_mode = "driver"
      }
      connect {
        sidecar_service {}
      }
    }
curl localhost:14002/actuator/health
curl: (56) Recv failure: Connection reset by peer

Result: ❌

  • Job starts
  • Health check passes
  • Envoy is not able to connect
Attempt 3
    network {
      mode = "bridge"
      port "http" {
        to = "3000"
      }
    }
  service {
      name = "bar"
      port = "http"
      address_mode = "driver"
      check {
        port = "http"
        type = "http"
        path = "/actuator/health"
        interval = "10s"
        timeout = "2s"
        address_mode = "driver"
      }
      connect {
        sidecar_service {}
      }
    }
curl localhost:14002/actuator/health
curl: (56) Recv failure: Connection reset by peer

Result: ❌

  • Job starts
  • Health check passes
  • Envoy is not able to connect
Attempt 4 network { mode = "bridge" port "http" { to = "3000" } } service { name = "bar" port = "http" address_mode = "driver" check { port = "http" type = "http" path = "/" interval = "5s" timeout = "2s" address_mode = "driver" } connect { sidecar_service {} } }
curl localhost:14002/actuator/health
curl: (56) Recv failure: Connection reset by peer

Result: ❌

  • Job starts
  • Health check does not pass
  • Envoy is not able to connect
Attempt 5 Set port http `to = -1` ``` group "group" { count = 1 network { mode = "bridge" port "http" { to = -1 } } service { name = "bar" port = "3000" check { port = "http" type = "http" path = "/actuator/health" interval = "5s" timeout = "2s" } connect { sidecar_service {} } } ```

Result: ❌

  • Job starts
  • Health check passes health check
  • Envoy is not able to connect
Attempt 6 Use `expose = true` as mentioned in this issue https://github.com//issues/7556 ``` network { mode = "bridge" port "http" { to = -1 } } service { name = "bar" port = "3000" check { port = "http" type = "http" path = "/actuator/health" expose = true interval = "5s" timeout = "2s" } connect { sidecar_service { } } } ```

Result: ❌

  • Job starts
  • Health check passes health check
  • Envoy is not able to connect
@shoenig
Copy link
Member

shoenig commented Apr 13, 2020

Hey @spuder thanks for reporting, and sorry you're having trouble with this.

Rather than us trying to debug from bits of your configuration, do you mind starting from some examples and working backwards to figure out what's going wrong? This configuration is working with v0.11.0 for me. If this baseline example doesn't work, can you provide logs from nomad and consul agents with log_level=DEBUG?

# example.nomad

job "example" {
  datacenters = ["dc1"]

  group "api" {
    network {
      mode = "bridge"

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {}
      }

      check {
        name     = "api-health"
        type     = "http"
        port     = "healthcheck"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
        expose   = true
      }
    }

    task "web" {
      driver = "docker"

      config {
        image = "hashicorpnomad/counter-api:v1"
      }
    }
  }

  group "dashboard" {
    network {
      mode = "bridge"

      port "http" {
        static = 9002
        to     = 9002
      }

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "count-dashboard"
      port = "9002"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }
          }
        }
      }

      check {
        name     = "dashboard-health"
        type     = "http"
        port     = "healthcheck"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
        expose   = true
      }
    }

    task "dashboard" {
      driver = "docker"

      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }

      config {
        image = "hashicorpnomad/counter-dashboard:v1"
      }
    }
  }
}

Running)

$ consul agent -dev
$ sudo nomad agent -dev-connect
$ nomad job run example.nomad

Check Nomad)

$ nomad job status example
ID            = example
Name          = example
Submit Date   = 2020-04-13T16:41:44-06:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
api         0       0         1        0       0         0
dashboard   0       0         1        0       0         0

Latest Deployment
ID          = 9c22115e
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
api         1        1       1        0          2020-04-13T16:51:58-06:00
dashboard   1        1       1        0          2020-04-13T16:52:04-06:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
e0f7ed62  c5839b0b  api         0        run      running  27s ago  14s ago
e7b3cd01  c5839b0b  dashboard   0        run      running  27s ago  7s ago

Checking Consul)

$ curl -s localhost:8500/v1/agent/checks | jq '.[] | select(.Name=="dashboard-health")'
{
  "Node": "NUC10",
  "CheckID": "_nomad-check-5794a0c4287f9d66c4a5450586f7410b33a6bd3f",
  "Name": "dashboard-health",
  "Status": "passing",
  "Notes": "",
  "Output": "HTTP GET http://192.168.1.53:25646/health: 200 OK Output: Hello, you've hit /health\n",
  "ServiceID": "_nomad-task-e7b3cd01-3d24-a3f1-7841-ad897586fe0f-group-dashboard-count-dashboard-9002",
  "ServiceName": "count-dashboard",
  "ServiceTags": [],
  "Type": "http",
  "Definition": {},
  "CreateIndex": 0,
  "ModifyIndex": 0
}
$ curl -s localhost:8500/v1/agent/checks | jq '.[] | select(.Name=="api-health")'
{
  "Node": "NUC10",
  "CheckID": "_nomad-check-aab24708f3160bd44748d8b8f0a85b8c6e5ceb16",
  "Name": "api-health",
  "Status": "passing",
  "Notes": "",
  "Output": "HTTP GET http://192.168.1.53:21128/health: 200 OK Output: Hello, you've hit /health\n",
  "ServiceID": "_nomad-task-e0f7ed62-a523-0544-75ca-2a41402a2c93-group-api-count-api-9001",
  "ServiceName": "count-api",
  "ServiceTags": [],
  "Type": "http",
  "Definition": {},
  "CreateIndex": 0,
  "ModifyIndex": 0
}

Checking Dashboard)

$ curl -s -w '%{response_code}\n' localhost:9002 -o /dev/null
200

@shoenig
Copy link
Member

shoenig commented Apr 13, 2020

Likewise, I get similar successfull results using the underlying proxy.expose plumbing as opposed to the shortcut check.expose parameter used above.

job "example" {
  datacenters = ["dc1"]

  group "api" {
    network {
      mode = "bridge"

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {
          proxy {
            expose {
              path {
                path            = "/health"
                protocol        = "http"
                local_path_port = 9001
                listener_port   = "healthcheck"
              }
            }
          }
        }
      }

      check {
        name     = "api-health"
        type     = "http"
        port     = "healthcheck"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
      }
    }

    task "web" {
      driver = "docker"

      config {
        image = "hashicorpnomad/counter-api:v1"
      }
    }
  }

  group "dashboard" {
    network {
      mode = "bridge"

      port "http" {
        static = 9002
        to     = 9002
      }

      port "healthcheck" {
        to = -1
      }
    }

    service {
      name = "count-dashboard"
      port = "9002"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port  = 8080
            }

            expose {
              path {
                path            = "/health"
                protocol        = "http"
                local_path_port = 9002
                listener_port   = "healthcheck"
              }
            }
          }
        }
      }

      check {
        name     = "dashboard-health"
        type     = "http"
        port     = "healthcheck"
        path     = "/health"
        interval = "10s"
        timeout  = "3s"
      }
    }

    task "dashboard" {
      driver = "docker"

      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }

      config {
        image = "hashicorpnomad/counter-dashboard:v1"
      }
    }
  }
}

@spuder
Copy link
Contributor Author

spuder commented Apr 14, 2020

I figured it out. The name attribute is required on both the service and the check. If you only put the name on one or the other, consul connect will fail without any errors.

    network {
      mode = "bridge"
      port "http" {
        to = -1
      }
    }
    service {
      name = "foo"      # <- Named service 1
      port = "3000"
      check {
        name = "foo-health"      # <- Named service 2
        port = "http"
        type = "http"
        path = "/actuator/health"
        interval = "5s"
        timeout = "2s"
        expose   = true
      }
      connect {
        sidecar_service {}
      }
    }

Possible remediations

  1. Make 'name' a required attribute when using expose = true
  2. Document that health checks must have a unique name.

@shoenig
Copy link
Member

shoenig commented May 28, 2020

Make 'name' a required attribute when using expose = true
Document that health checks must have a unique name.

Both of these sound like good suggestions, @spuder

@jharley
Copy link

jharley commented Sep 5, 2020

Possibly related to #7221: if the service stanza is using a named port (e.g. port = "http" and not port = 5000) it will generate an error: error in job mutator expose-check: unable to determine local service port for service check.

@zhenik
Copy link

zhenik commented Sep 29, 2020

Hi, I use this example. Differences: no need to register dynamic port -1 (static for ui) and no port stanza under check

job "countdash" {
  datacenters = ["dc1"]
  group "api" {
    network {
      mode = "bridge"
    }

    service {
      name = "count-api"
      port = "9001"

      connect {
        sidecar_service {}
      }
      check {
        expose   = true
        name     = "api-alive"
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "2s"
      }
    }

    task "web" {
      driver = "docker"
      config {
        image = "hashicorpnomad/counter-api:v1"
      }
    }
  }

  group "dashboard" {
    network {
      mode ="bridge"
      port "http" {
        static = 9002
        to     = 9002
      }
    }

    service {
      name = "count-dashboard"
      port = "9002"

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "count-api"
              local_bind_port = 8080
            }
          }
        }
      }
      check {
        expose   = true
        name     = "dashboard-alive"
        type     = "http"
        path     = "/health"
        interval = "10s"
        timeout  = "2s"
      }
    }

    task "dashboard" {
      driver = "docker"
      env {
        COUNTING_SERVICE_URL = "http://${NOMAD_UPSTREAM_ADDR_count_api}"
      }
      config {
        image = "hashicorpnomad/counter-dashboard:v1"
      }
    }
  }
}

Another example with minio

job "minio" {

  type          = "service"
  datacenters   = ["dc1"]
  namespace     = "default"

  group "s3" {
    network {
      mode = "bridge"
    }
    service {
      name = "minio"
      port = 9000
      # https://docs.min.io/docs/minio-monitoring-guide.html
      check {
        expose    = true
        name      = "minio-live"
        type      = "http"
        path      = "/minio/health/live"
        interval  = "10s"
        timeout   = "2s"
      }
      check {
        expose    = true
        name      = "minio-ready"
        type      = "http"
        path      = "/minio/health/ready"
        interval  = "15s"
        timeout   = "4s"
      }
      connect {
        sidecar_service {
        }
      }
    }

    task "server" {
      driver = "docker"

      config {
        image             = "minio/minio:latest"
        memory_hard_limit = 2048
        args              = [
          "server",
          "/local/data",
          "-address",
          "127.0.0.1:9000"
        ]
      }
      resources {
        cpu     = 200
        memory  = 1024
      }
    }
  }
}

@shoenig
Copy link
Member

shoenig commented Jun 17, 2021

Make 'name' a required attribute when using expose = true
Document that health checks must have a unique name.

These aren't required anymore, I think in recent versions of Consul

Possibly related to #7221: if the service stanza is using a named port (e.g. port = "http" and not port = 5000) it will generate an error: error in job mutator expose-check: unable to determine local service port for service check.

Using a network port label for a service port that will be fronted by a Connect sidecar is probably not what you intended - the service.port value in this case is informing Envoy of the local port your service is going to bind to (inside the network namespace). Unlike with normal services, it is not used for service discovery, and should not be referenced by anything other than the internal Connect plumbing, making the value of a port label here dubious*. I'd like to better document this in #10677.

[*] you could do something like

port "api" {
  static = 9001
  to = 9001
}

and reference the api port label, but then you have a hole in your service mesh.

@shoenig shoenig closed this as completed Jun 17, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/consul/connect Consul Connect integration theme/docs Documentation issues and enhancements type/enhancement
Projects
None yet
Development

No branches or pull requests

4 participants