Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] data.consul_service using health endpoint #87

Closed
eedwards-sk opened this issue Jan 17, 2019 · 25 comments
Closed

[Feature Request] data.consul_service using health endpoint #87

eedwards-sk opened this issue Jan 17, 2019 · 25 comments

Comments

@eedwards-sk
Copy link

Currently, data.consul_service has almost no use for me, as I discovered it returns from the catalog and does not exercise the health check results.

Ideally, I'd like to use data.consul_service to return the healthy addresses for a service. I'm not sure what use cases people have where they want to get unhealthy results, but I don't have one.

I need to configure terraform's vault provider with the address of a healthy vault instance. With the current data.consul_service behavior, I end up getting back node addresses of unhealthy or left nodes.

@remilapeyre
Copy link
Collaborator

Hi @eedwards-sk, can you give more information about your use-case?

It is true that a new data source may be needed to fetch health information about a given service but maybe you can use Consul DNS interface to achieve the same goal.

@eedwards-sk
Copy link
Author

eedwards-sk commented Jan 17, 2019

Hi @remilapeyre thanks for the response

Unfortunately I cannot use dns, and I tried many, many ways to make that work first (I know more about iptables and resolv.conf now than I ever cared to).

I'm doing orchestration running terraform as concourse-ci tasks and inside containers you cannot override resolvers on alpine, for example (thanks to how it queries all dns servers regardless of order), and most docker hosts (or even in my case, runc with concourse) also usually controls the resolv.conf, so it's not always possible to redirect dns to a consul agent.

I tried to state the use case above, but I'll reiterate and try to expand on it further:

I need to configure terraform's vault provider with the address of a healthy vault instance. With the current data.consul_service behavior, I end up getting back node addresses of unhealthy or left failed nodes.

from shared.tf on a project where I'm setting some vault configuration:

# ==========
# data sources
# ==========
data "consul_service" "vault" {
  name = "vault"
}

# ==========
# providers
# ==========
provider "consul" {
  version = "~> 2.2"
}
provider "vault" {
  version = "~> 1.4"
  address = "https://${data.consul_service.vault.service.0.address}:${data.consul_service.vault.service.0.port}"
}

As you can see, I'm leveraging the data source to get the address of the service.

Because it's the catalog interface, it returns all nodes, regardless of health.

This works brilliantly when there's no dead nodes, but breaks the moment there's a dead node, which is quite common if you have client nodes in an autoscale group or similar.

So ideally I need a data source that supports the health endpoint. Not sure if it would be a generic macro scale consul_health type resource that would be filterable down (e.g. service = "vault" or similar), or a consul_service_health resource that is already scoped.

Does that make sense?

I'm doing a lot of automation and orchestration involving the set up and administration of a consul, vault, and concourse stack -- as mentioned above, you cannot easily mess with container DNS, so this is the ideal method of retrieving service node addresses.

By being able to actually use the data source to get healthy addresses, then I can effectively use the data source for service discovery. As it stands, it doesn't actually serve much use, because the result set includes unhealthy nodes.

I need it badly enough that I'm willing to learn Go and start a PR If I have to, but if anyone else wants to jump on it, I'd wish them all the luck and kindness in the world!

@remilapeyre
Copy link
Collaborator

Hi @eedwards-sk, thanks for taking the time to explain why using the DNS interface was not an option.

A new consul_service_health datasource aroung /health/service/:service is probably the best way forward.

I can start working on this next week.

@eedwards-sk
Copy link
Author

Awesome, please let me know if you need any testing. I'm happy to help!

@remilapeyre
Copy link
Collaborator

remilapeyre commented Jan 29, 2019

Hi @eedwards-sk, I started working on the PR, it is not fully ready yet but you can see the progress #89.

You should be able to get vault healthy instances with:

data "consul_service_health" "vault" {
  service = "vault"
  passing = true
}

provider "consul" {
  version = "~> 2.2"
}

provider "vault" {
  version = "~> 1.4"
  address = "https://${data.consul_service_health.vault.nodes.0.service_address}:${data.consul_service_health.vault.nodes.0.service_port}"
}

Can you try it and give your feedback?

@eedwards-sk
Copy link
Author

@remilapeyre Sure, what's the correct way to load a provider overriding the default (so I can use your PR's version)?

@remilapeyre
Copy link
Collaborator

You can put the provider in ~/.terraform.d/plugins (or %APPDATA%\terraform.d\plugins on Windows)

https://www.terraform.io/docs/extend/how-terraform-works.html#discovery

@remilapeyre
Copy link
Collaborator

If you are not able to build it, I think I should be able to send you a cross-compiled binary

@eedwards-sk
Copy link
Author

Thanks, I'm running terraform through my tool concourse-terraform, so if I can pull in your PR and stage it into the terraform working directory, if that will work?

I'd need to confirm the expected layout of the plugin dir once checked out. It's not easy to stage them in .terraform.d, or anything in the home directory. working directory is best.

If you are not able to build it, I think I should be able to send you a cross-compiled binary

That would work :) I'm running this inside alpine

@remilapeyre
Copy link
Collaborator

You can download it here: https://temp-terraform-consul.s3.eu-central-1.amazonaws.com/terraform-provider-consul

The sha sum should be 3b8d07ceeaa14a914255988d1148592a09e9d3d8

@remilapeyre
Copy link
Collaborator

I don't think you can build it in your image but if you copy it to /root/.terraform.d/plugins I think it should work

@eedwards-sk
Copy link
Author

okay, I was successfully able to try it after:

  • making the file executable
  • renaming it to match the previous plugin exe format (version in filename)

I got a failure so I enabled debug output and captured it:

data.consul_service_health.vault: Refreshing state...
2019/01/30 00:04:30 [ERROR] root: eval: *terraform.EvalReadDataApply, err: data.consul_service_health.vault: Failed to retrieve service health: time: invalid duration
2019/01/30 00:04:30 [ERROR] root: eval: *terraform.EvalSequence, err: data.consul_service_health.vault: Failed to retrieve service health: time: invalid duration

2019/01/30 00:04:30 [DEBUG] plugin: waiting for all plugin processes to complete...
Error: Error refreshing state: 1 error(s) occurred:

* data.consul_service_health.vault: 1 error(s) occurred:

* data.consul_service_health.vault: data.consul_service_health.vault: Failed to retrieve service health: time: invalid duration


2019-01-30T00:04:30.913Z [DEBUG] plugin.terraform-provider-consul_v2.2.0_x4: 2019/01/30 00:04:30 [ERR] plugin: plugin server: accept unix /tmp/plugin363343893: use of closed network connection
2019-01-30T00:04:30.913Z [DEBUG] plugin: plugin process exited: path=/tmp/tfwork/terraform/terraform-provider-consul_v2.2.0_x4
2019-01-30T00:04:30.920Z [DEBUG] plugin.terraform-provider-vault_v1.4.1_x4: 2019/01/30 00:04:30 [ERR] plugin: plugin server: accept unix /tmp/plugin093530756: use of closed network connection
2019-01-30T00:04:30.924Z [DEBUG] plugin: plugin process exited: path=/tmp/tfwork/terraform/.terraform/plugins/linux_amd64/terraform-provider-vault_v1.4.1_x4

@remilapeyre
Copy link
Collaborator

Thanks, I will do further testing tonight.

@remilapeyre
Copy link
Collaborator

Hi @eedwards-sk, I made a more comprehensive test but wasn't able to reproduce the bug.

Could you send me the result of http://consul_hostname:8500/v1/health/service/vault if there is no confidential information in it?

@eedwards-sk
Copy link
Author

@remilapeyre absolutely! This is a testing cluster that I'll be tearing down anyways.

vault service health
[
  {
    "Node": {
      "ID": "c76b4435-8d86-231e-2e20-9e856a33a7aa",
      "Node": "i-00095293ed1bc9935",
      "Address": "172.31.47.106",
      "Datacenter": "us-east-1",
      "TaggedAddresses": {
        "lan": "172.31.47.106",
        "wan": "172.31.47.106"
      },
      "Meta": {
        "consul-network-segment": ""
      },
      "CreateIndex": 86984,
      "ModifyIndex": 86985
    },
    "Service": {
      "ID": "vault:172.31.47.106:8200",
      "Service": "vault",
      "Tags": [
        "active"
      ],
      "Address": "172.31.47.106",
      "Meta": null,
      "Port": 8200,
      "Weights": {
        "Passing": 1,
        "Warning": 1
      },
      "EnableTagOverride": false,
      "ProxyDestination": "",
      "Proxy": {},
      "Connect": {},
      "CreateIndex": 86988,
      "ModifyIndex": 87005
    },
    "Checks": [
      {
        "Node": "i-00095293ed1bc9935",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "ServiceTags": [],
        "Definition": {
          "Interval": "",
          "Timeout": "",
          "DeregisterCriticalServiceAfter": ""
        },
        "CreateIndex": 86984,
        "ModifyIndex": 86984
      },
      {
        "Node": "i-00095293ed1bc9935",
        "CheckID": "vault:172.31.47.106:8200:vault-sealed-check",
        "Name": "Vault Sealed Status",
        "Status": "passing",
        "Notes": "Vault service is healthy when Vault is in an unsealed status and can become an active Vault server",
        "Output": "Vault Unsealed",
        "ServiceID": "vault:172.31.47.106:8200",
        "ServiceName": "vault",
        "ServiceTags": [
          "active"
        ],
        "Definition": {
          "Interval": "",
          "Timeout": "",
          "DeregisterCriticalServiceAfter": ""
        },
        "CreateIndex": 86989,
        "ModifyIndex": 87006
      }
    ]
  },
  {
    "Node": {
      "ID": "bb1bf2ef-5ad6-30b5-edba-a1bf2ca4565a",
      "Node": "i-0e848a28755f837af",
      "Address": "172.31.91.78",
      "Datacenter": "us-east-1",
      "TaggedAddresses": {
        "lan": "172.31.91.78",
        "wan": "172.31.91.78"
      },
      "Meta": {
        "consul-network-segment": ""
      },
      "CreateIndex": 87030,
      "ModifyIndex": 87032
    },
    "Service": {
      "ID": "vault:172.31.91.78:8200",
      "Service": "vault",
      "Tags": [
        "standby"
      ],
      "Address": "172.31.91.78",
      "Meta": null,
      "Port": 8200,
      "Weights": {
        "Passing": 1,
        "Warning": 1
      },
      "EnableTagOverride": false,
      "ProxyDestination": "",
      "Proxy": {},
      "Connect": {},
      "CreateIndex": 87036,
      "ModifyIndex": 87036
    },
    "Checks": [
      {
        "Node": "i-0e848a28755f837af",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": "",
        "ServiceTags": [],
        "Definition": {
          "Interval": "",
          "Timeout": "",
          "DeregisterCriticalServiceAfter": ""
        },
        "CreateIndex": 87030,
        "ModifyIndex": 87030
      },
      {
        "Node": "i-0e848a28755f837af",
        "CheckID": "vault:172.31.91.78:8200:vault-sealed-check",
        "Name": "Vault Sealed Status",
        "Status": "passing",
        "Notes": "Vault service is healthy when Vault is in an unsealed status and can become an active Vault server",
        "Output": "Vault Unsealed",
        "ServiceID": "vault:172.31.91.78:8200",
        "ServiceName": "vault",
        "ServiceTags": [
          "standby"
        ],
        "Definition": {
          "Interval": "",
          "Timeout": "",
          "DeregisterCriticalServiceAfter": ""
        },
        "CreateIndex": 87037,
        "ModifyIndex": 87087
      }
    ]
  }
]

@remilapeyre
Copy link
Collaborator

I'm a bit puzzled because in

    "Definition": {
          "Interval": "",
          "Timeout": "",
          "DeregisterCriticalServiceAfter": ""
        },

each field should contain a duration like "5s".

When testing, I run a Consul development server with consul agent -dev and a Vault server with vault server -dev -config vault.hcl with vault.hcl being:

storage "consul" {
  address = "127.0.0.1:8500"
  path    = "vault"
}

and when fetching http://localhost:8500/v1/health/service/vault I get:

[
    {
        "Node": {
            "ID": "ebe57535-3fcc-4431-24f4-08620388bf0d",
            "Node": "MBP-de-Remi",
            "Address": "127.0.0.1",
            "Datacenter": "dc1",
            "TaggedAddresses": {
                "lan": "127.0.0.1",
                "wan": "127.0.0.1"
            },
            "Meta": {
                "consul-network-segment": ""
            },
            "CreateIndex": 9,
            "ModifyIndex": 10
        },
        "Service": {
            "ID": "vault:127.0.0.1:8200",
            "Service": "vault",
            "Tags": [
                "active"
            ],
            "Address": "127.0.0.1",
            "Meta": null,
            "Port": 8200,
            "Weights": {
                "Passing": 1,
                "Warning": 1
            },
            "EnableTagOverride": false,
            "ProxyDestination": "",
            "Proxy": {},
            "Connect": {},
            "CreateIndex": 44,
            "ModifyIndex": 48
        },
        "Checks": [
            {
                "Node": "MBP-de-Remi",
                "CheckID": "serfHealth",
                "Name": "Serf Health Status",
                "Status": "passing",
                "Notes": "",
                "Output": "Agent alive and reachable",
                "ServiceID": "",
                "ServiceName": "",
                "ServiceTags": [],
                "Definition": {},
                "CreateIndex": 9,
                "ModifyIndex": 9
            },
            {
                "Node": "MBP-de-Remi",
                "CheckID": "vault:127.0.0.1:8200:vault-sealed-check",
                "Name": "Vault Sealed Status",
                "Status": "passing",
                "Notes": "Vault service is healthy when Vault is in an unsealed status and can become an active Vault server",
                "Output": "",
                "ServiceID": "vault:127.0.0.1:8200",
                "ServiceName": "vault",
                "ServiceTags": [
                    "active"
                ],
                "Definition": {},
                "CreateIndex": 46,
                "ModifyIndex": 49
            }
        ]
    }
]

where "Definition" is the empty object {}.

Can you tell me your versions of Vault and Consul?

@eedwards-sk
Copy link
Author

eedwards-sk commented Jan 30, 2019

vault 1.0.1
consul 1.4.1

vault automatically registers the health checks with consul

https://github.com/hashicorp/vault/blob/v1.0.1/physical/consul/consul.go
https://github.com/hashicorp/vault/blob/v1.0.1/vendor/github.com/hashicorp/consul/api/api.go
https://github.com/hashicorp/vault/blob/v1.0.1/vendor/github.com/hashicorp/consul/api/agent.go

Edit: specifically, https://github.com/hashicorp/vault/blob/v1.0.1/physical/consul/consul.go#L827-L844

Edit2: Doesn't seem to be anything to do with vault. Consul's own serfHealth check also has a definition block like that (with keys with empty values).

Maybe something to do with -dev mode? These are production-grade setups.

@remilapeyre
Copy link
Collaborator

Ok, I'm using Consul 1.4.0 and the difference in behavior comes from that.

There has been many changes to this endpoint between 1.4.0 and 1.4.1, git log -p v1.4.0...v1.4.1 then looking for DeregisterCriticalServiceAfter show most of them.

I will try to update vendor and I expect the issue to disappear.

@remilapeyre
Copy link
Collaborator

Ok, I updated the plugin at https://temp-terraform-consul.s3.eu-central-1.amazonaws.com/terraform-provider-consul

The new shasum is f4c3372e89fce52dc65b3046fe76c2b07e16c421 and it should work against Consul 1.4.1

@eedwards-sk
Copy link
Author

plan with that version was successful

@remilapeyre
Copy link
Collaborator

Awesome 🎉 . I need to have a better test grid with multiple version of consul so this does not happen again in the future.

Thanks for your help!

@eedwards-sk
Copy link
Author

Thanks for the work!

@eedwards-sk
Copy link
Author

heya @remilapeyre any progress on this?

@remilapeyre
Copy link
Collaborator

Hi @eedwards-sk, I'm having some issue with implementing the retry suggested in #89 (comment)

I think I will make a new release in the next few days and will merge the feature as it is now, without the retry option, if I can't get it to work.

@eedwards-sk
Copy link
Author

sounds good, the retry is outside the scope of my use case anyway, but would make a good follow-up enhancement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants