Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

X-Consul-Index behaves strangely on non-existing services with an existing service prefix #3712

Closed
pierresouchay opened this issue Nov 25, 2017 · 4 comments
Labels
theme/api Relating to the HTTP API interface thinking More time is needed to research by the Consul Contributors type/enhancement Proposed improvement or new feature
Milestone

Comments

@pierresouchay
Copy link
Contributor

consul version for both Client and Server

Client: 0.9.3
Server: 0.9.3

Description of the Issue (and unexpected/desired result)

We have several large clusters (~5k hosts per clusters) connected thru WAN and have some scalability issues. During one of my investigations, I noticed a strange bug:

  1. When requesting a non-existing service with a X-Consul-Index, consul blocks as long as the timeout occurs or the service is being declared => OK
  2. When requesting an existing service with a X-Consul-Index fetched on this service, consul blocks as the timeout occurs or the service gets updated => OK
  3. But, when requesting a non-existing service with a X-Consul-Index, if another service with a shorter name exists, Consul does not block and all updates of X-Consul-Index are getting catched.

Reproduction steps

Create this script called consul_check_services_changes.sh

#!/bin/bash
if test ${1:-none} = "none"
then
  echo "USAGE: $0 Consul_URL"
  echo "       Example: localhost:8500/v1/health/service/MY_SUPER_SERVICE"
  exit 1
fi
url_to_check=$1

headers=$(mktemp)
content=$(mktemp)
index=0
while true;
do
  url="${url_to_check}?wait=5m&index=${index}&pretty=true&stale"
  curl -fs --dump-header "$headers" -o "${content}.new" "${url}" || { "echo Failed to query ${url}"; exit 1; }
  if test $index -ne 0
  then
    diff -u "$content" "$content.new" && echo " diff: No Differences found in service"
  fi
  index=$(grep "X-Consul-Index" "$headers" | sed 's/[^0-9]*\([0-9][0-9]*\)[^0-9]*/\1/g')
  mv "$content.new" "$content"
  printf "X-Consul-Index: $index at $(date) \b"
done

Call it on a non-existing service:

./consul_check_services_changes.sh consul-agent:8500/v1/catalog/service/non-existing-service
X-Consul-Index: 37968409 at Sat 25 Nov 2017 01:27:09 CET

Blocks -> OK

Call it on a real service:

consul_check_services_changes.sh consul-agent:8500/v1/catalog/service/consul
X-Consul-Index: 37968524 at Sat 25 Nov 2017 01:27:31 CET

Blocks as long as the service is not updated -> OK

Call it on a service that does not exists but has an existing service as prefix

consul_check_services_changes.sh consul-agent:8500/v1/catalog/service/consul2
X-Consul-Index: 37968837 at Sat 25 Nov 2017 01:28:06 CET diff: No Differences found in service
X-Consul-Index: 37968864 at Sat 25 Nov 2017 01:28:07 CET diff: No Differences found in service
X-Consul-Index: 37968865 at Sat 25 Nov 2017 01:28:07 CET diff: No Differences found in service
X-Consul-Index: 37968868 at Sat 25 Nov 2017 01:28:07 CET diff: No Differences found in service
X-Consul-Index: 37968869 at Sat 25 Nov 2017 01:28:07 CET diff: No Differences found in service
X-Consul-Index: 37968877 at Sat 25 Nov 2017 01:28:08 CET diff: No Differences found in service
X-Consul-Index: 37968878 at Sat 25 Nov 2017 01:28:08 CET diff: No Differences found in service
X-Consul-Index: 37968879 at Sat 25 Nov 2017 01:28:08 CET diff: No Differences found in service
X-Consul-Index: 37968880 at Sat 25 Nov 2017 01:28:09 CET diff: No Differences found in service
X-Consul-Index: 37968887 at Sat 25 Nov 2017 01:28:09 CET diff: No Differences found in service
X-Consul-Index: 37968888 at Sat 25 Nov 2017 01:28:09 CET diff: No Differences found in service
X-Consul-Index: 37968896 at Sat 25 Nov 2017 01:28:12 CET diff: No Differences found in service
X-Consul-Index: 37968898 at Sat 25 Nov 2017 01:28:12 CET diff: No Differences found in service
X-Consul-Index: 37968914 at Sat 25 Nov 2017 01:28:12 CET diff: No Differences found in service
X-Consul-Index: 37968915 at Sat 25 Nov 2017 01:28:12 CET diff: No Differences found in service
X-Consul-Index: 37968919 at Sat 25 Nov 2017 01:28:13 CET diff: No Differences found in service
X-Consul-Index: 37968920 at Sat 25 Nov 2017 01:28:13 CET diff: No Differences found in service
X-Consul-Index: 37968921 at Sat 25 Nov 2017 01:28:14 CET diff: No Differences found in service
[...]

Basically, we are notified on each change of X-Consul-Index for such services. Thus I suspect a significant but in the implementation. (It should behave the same way as either and existing service or a non-existing service right ?

@slackpad
Copy link
Contributor

Hi @pierresouchay thanks for opening an issue - I was wondering when someone would notice this :-)

This is actually working as intended, though it is a little weird. Internally, Consul stores all of its data like the catalog using https://github.com/hashicorp/go-memdb which uses radix trees for prefix-based indexing. There's a watch mechanism whereby queries return a "watch set" with a set of channels to be notified if something changes (here's an example). When you query for a non-existent entry, you end up with a high-level node in the radix tree to watch on, up to possibly the root node if there's not much data in the index. Because blocking queries can always wake up without returning new data we decided not to optimize this further and add more complexity to the query code, though we did think about it. The results for a non-existent query are cheap to serve, and once the index gets populated with some real data you usually end up watching a more specific prefix node.

Your case is kind of pathological because you have some other service that's a prefix of the one you are watching, so indeed you'd get some churn in there. If you are really seeing a few QPS in practice it's probably not a huge deal, though it might be worth addressing if it's causing huge amounts of bogus requests.

@slackpad slackpad added type/enhancement Proposed improvement or new feature theme/api Relating to the HTTP API interface thinking More time is needed to research by the Consul Contributors labels Dec 19, 2017
@slackpad slackpad added this to the Unplanned milestone Dec 19, 2017
@weiwei04
Copy link
Contributor

Hi @slackpad, unlike @pierresouchay, we rely on this intended behavior.

We have about 600+ different services and the number is still growing, the services from same department have same prefix in service names, so our LoadBalancer use prefix to block query those services instead of 1 block query per services. Can we rely on this intended behavior on future consul release?

For example:

services:
com.a.a0, com.a.a1, com.a.a2
com.b.b0, com.b.b1, com.b.b2

we block query com.a to get a0, a1, a2 service update, block query com.b to get b0, b1, b2 service update.

@slackpad
Copy link
Contributor

Hi @weiwei04 that's an interesting use case. Because of the internal implementation this should work because everything is managed with a prefix tree, and we don't have plans to change that, but it's not an explicit part of the interface contract. It might be better as an enhancement request to make a more explicit new API to watch for changes for service by prefix.

iksaif pushed a commit to iksaif/prometheus that referenced this issue Feb 2, 2018
- Do not require a call to the catalog if services are specified by name.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: hashicorp/consul#3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.

TODO get number: CPU/RAM/QPS/Latency for
- filter by relabel (before) vs. filter by tag (after)
- filter by service before after
iksaif pushed a commit to criteo-forks/prometheus that referenced this issue Mar 16, 2018
Related to prometheus#3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: hashicorp/consul#3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.
iksaif pushed a commit to iksaif/prometheus that referenced this issue Mar 21, 2018
Related to prometheus#3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: hashicorp/consul#3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.
brian-brazil pushed a commit to prometheus/prometheus that referenced this issue Mar 23, 2018
* consul: improve consul service discovery

Related to #3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: hashicorp/consul#3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.

* consul: tweak `refresh_interval` behavior

`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).

This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.

Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
sipian pushed a commit to sipian/prometheus that referenced this issue May 18, 2018
* consul: improve consul service discovery

Related to prometheus#3711

- Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services`
  allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`).
  Tags and nore-meta are also used in `/catalog/service` requests.
- Do not require a call to the catalog if services are specified by name. This is important
  because on large cluster `/catalog/services` changes all the time.
- Add `allow_stale` configuration option to do stale reads. Non-stale
  reads can be costly, even more when you are doing them to a remote
  datacenter with 10k+ targets over WAN (which is common for federation).
- Add `refresh_interval` to minimize the strain on the catalog and on the
  service endpoint. This is needed because of that kind of behavior from
  consul: hashicorp/consul#3712 and because a catalog
  on a large cluster would basically change *all* the time. No need to discover
  targets in 1sec if we scrape them every minute.
- Added plenty of unit tests.

Benchmarks
----------

```yaml
scrape_configs:

- job_name: prometheus
  scrape_interval: 60s
  static_configs:
    - targets: ["127.0.0.1:9090"]

- job_name: "observability-by-tag"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      tag: marathon-user-observability  # Used in After
      refresh_interval: 30s             # Used in After+delay
  relabel_configs:
    - source_labels: [__meta_consul_tags]
      regex: ^(.*,)?marathon-user-observability(,.*)?$
      action: keep

- job_name: "observability-by-name"
  scrape_interval: "60s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - observability-cerebro
        - observability-portal-web

- job_name: "fake-fake-fake"
  scrape_interval: "15s"
  metrics_path: "/metrics"
  consul_sd_configs:
    - server: consul.service.par.consul.prod.crto.in:8500
      services:
        - fake-fake-fake
```

Note: tested with ~1200 services, ~5000 nodes.

| Resource | Empty | Before | After | After + delay |
| -------- |:-----:|:------:|:-----:|:-------------:|
|/service-discovery size|5K|85MiB|27k|27k|27k|
|`go_memstats_heap_objects`|100k|1M|120k|110k|
|`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB|
|`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s|
|`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%|
|`process_open_fds`|16|*1236*|22|22|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*|
|`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5|
|`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms|
|Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps|

Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul.
Being a little bit smarter about this reduces the overhead quite a lot.
Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.

* consul: tweak `refresh_interval` behavior

`refresh_interval` now does what is advertised in the documentation,
there won't be more that one update per `refresh_interval`. It now
defaults to 30s (which was also the current waitTime in the consul query).

This also make sure we don't wait another 30s if we already waited 29s
in the blocking call by substracting the number of elapsed seconds.

Hopefully this will do what people expect it does and will be safer
for existing consul infrastructures.
Signed-off-by: sipian <cs15btech11019@iith.ac.in>
@pierresouchay
Copy link
Contributor Author

Fixed in Prometheus and Consul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/api Relating to the HTTP API interface thinking More time is needed to research by the Consul Contributors type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

3 participants