-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
X-Consul-Index behaves strangely on non-existing services with an existing service prefix #3712
Comments
Hi @pierresouchay thanks for opening an issue - I was wondering when someone would notice this :-) This is actually working as intended, though it is a little weird. Internally, Consul stores all of its data like the catalog using https://github.com/hashicorp/go-memdb which uses radix trees for prefix-based indexing. There's a watch mechanism whereby queries return a "watch set" with a set of channels to be notified if something changes (here's an example). When you query for a non-existent entry, you end up with a high-level node in the radix tree to watch on, up to possibly the root node if there's not much data in the index. Because blocking queries can always wake up without returning new data we decided not to optimize this further and add more complexity to the query code, though we did think about it. The results for a non-existent query are cheap to serve, and once the index gets populated with some real data you usually end up watching a more specific prefix node. Your case is kind of pathological because you have some other service that's a prefix of the one you are watching, so indeed you'd get some churn in there. If you are really seeing a few QPS in practice it's probably not a huge deal, though it might be worth addressing if it's causing huge amounts of bogus requests. |
Hi @slackpad, unlike @pierresouchay, we rely on this We have about 600+ different services and the number is still growing, the services from same department have same prefix in service names, so our LoadBalancer use prefix to block query those services instead of 1 block query per services. Can we rely on this For example: services: we block query com.a to get a0, a1, a2 service update, block query com.b to get b0, b1, b2 service update. |
Hi @weiwei04 that's an interesting use case. Because of the internal implementation this should work because everything is managed with a prefix tree, and we don't have plans to change that, but it's not an explicit part of the interface contract. It might be better as an enhancement request to make a more explicit new API to watch for changes for service by prefix. |
- Do not require a call to the catalog if services are specified by name. - Add `allow_stale` configuration option to do stale reads. Non-stale reads can be costly, even more when you are doing them to a remote datacenter with 10k+ targets over WAN (which is common for federation). - Add `refresh_interval` to minimize the strain on the catalog and on the service endpoint. This is needed because of that kind of behavior from consul: hashicorp/consul#3712 and because a catalog on a large cluster would basically change *all* the time. No need to discover targets in 1sec if we scrape them every minute. TODO get number: CPU/RAM/QPS/Latency for - filter by relabel (before) vs. filter by tag (after) - filter by service before after
Related to prometheus#3711 - Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services` allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`). Tags and nore-meta are also used in `/catalog/service` requests. - Do not require a call to the catalog if services are specified by name. This is important because on large cluster `/catalog/services` changes all the time. - Add `allow_stale` configuration option to do stale reads. Non-stale reads can be costly, even more when you are doing them to a remote datacenter with 10k+ targets over WAN (which is common for federation). - Add `refresh_interval` to minimize the strain on the catalog and on the service endpoint. This is needed because of that kind of behavior from consul: hashicorp/consul#3712 and because a catalog on a large cluster would basically change *all* the time. No need to discover targets in 1sec if we scrape them every minute. - Added plenty of unit tests. Benchmarks ---------- ```yaml scrape_configs: - job_name: prometheus scrape_interval: 60s static_configs: - targets: ["127.0.0.1:9090"] - job_name: "observability-by-tag" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 tag: marathon-user-observability # Used in After refresh_interval: 30s # Used in After+delay relabel_configs: - source_labels: [__meta_consul_tags] regex: ^(.*,)?marathon-user-observability(,.*)?$ action: keep - job_name: "observability-by-name" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - observability-cerebro - observability-portal-web - job_name: "fake-fake-fake" scrape_interval: "15s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - fake-fake-fake ``` Note: tested with ~1200 services, ~5000 nodes. | Resource | Empty | Before | After | After + delay | | -------- |:-----:|:------:|:-----:|:-------------:| |/service-discovery size|5K|85MiB|27k|27k|27k| |`go_memstats_heap_objects`|100k|1M|120k|110k| |`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB| |`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s| |`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%| |`process_open_fds`|16|*1236*|22|22| |`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*| |`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5| |`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms| |Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps| Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul. Being a little bit smarter about this reduces the overhead quite a lot. Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.
Related to prometheus#3711 - Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services` allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`). Tags and nore-meta are also used in `/catalog/service` requests. - Do not require a call to the catalog if services are specified by name. This is important because on large cluster `/catalog/services` changes all the time. - Add `allow_stale` configuration option to do stale reads. Non-stale reads can be costly, even more when you are doing them to a remote datacenter with 10k+ targets over WAN (which is common for federation). - Add `refresh_interval` to minimize the strain on the catalog and on the service endpoint. This is needed because of that kind of behavior from consul: hashicorp/consul#3712 and because a catalog on a large cluster would basically change *all* the time. No need to discover targets in 1sec if we scrape them every minute. - Added plenty of unit tests. Benchmarks ---------- ```yaml scrape_configs: - job_name: prometheus scrape_interval: 60s static_configs: - targets: ["127.0.0.1:9090"] - job_name: "observability-by-tag" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 tag: marathon-user-observability # Used in After refresh_interval: 30s # Used in After+delay relabel_configs: - source_labels: [__meta_consul_tags] regex: ^(.*,)?marathon-user-observability(,.*)?$ action: keep - job_name: "observability-by-name" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - observability-cerebro - observability-portal-web - job_name: "fake-fake-fake" scrape_interval: "15s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - fake-fake-fake ``` Note: tested with ~1200 services, ~5000 nodes. | Resource | Empty | Before | After | After + delay | | -------- |:-----:|:------:|:-----:|:-------------:| |/service-discovery size|5K|85MiB|27k|27k|27k| |`go_memstats_heap_objects`|100k|1M|120k|110k| |`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB| |`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s| |`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%| |`process_open_fds`|16|*1236*|22|22| |`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*| |`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5| |`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms| |Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps| Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul. Being a little bit smarter about this reduces the overhead quite a lot. Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery.
* consul: improve consul service discovery Related to #3711 - Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services` allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`). Tags and nore-meta are also used in `/catalog/service` requests. - Do not require a call to the catalog if services are specified by name. This is important because on large cluster `/catalog/services` changes all the time. - Add `allow_stale` configuration option to do stale reads. Non-stale reads can be costly, even more when you are doing them to a remote datacenter with 10k+ targets over WAN (which is common for federation). - Add `refresh_interval` to minimize the strain on the catalog and on the service endpoint. This is needed because of that kind of behavior from consul: hashicorp/consul#3712 and because a catalog on a large cluster would basically change *all* the time. No need to discover targets in 1sec if we scrape them every minute. - Added plenty of unit tests. Benchmarks ---------- ```yaml scrape_configs: - job_name: prometheus scrape_interval: 60s static_configs: - targets: ["127.0.0.1:9090"] - job_name: "observability-by-tag" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 tag: marathon-user-observability # Used in After refresh_interval: 30s # Used in After+delay relabel_configs: - source_labels: [__meta_consul_tags] regex: ^(.*,)?marathon-user-observability(,.*)?$ action: keep - job_name: "observability-by-name" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - observability-cerebro - observability-portal-web - job_name: "fake-fake-fake" scrape_interval: "15s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - fake-fake-fake ``` Note: tested with ~1200 services, ~5000 nodes. | Resource | Empty | Before | After | After + delay | | -------- |:-----:|:------:|:-----:|:-------------:| |/service-discovery size|5K|85MiB|27k|27k|27k| |`go_memstats_heap_objects`|100k|1M|120k|110k| |`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB| |`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s| |`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%| |`process_open_fds`|16|*1236*|22|22| |`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*| |`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5| |`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms| |Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps| Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul. Being a little bit smarter about this reduces the overhead quite a lot. Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery. * consul: tweak `refresh_interval` behavior `refresh_interval` now does what is advertised in the documentation, there won't be more that one update per `refresh_interval`. It now defaults to 30s (which was also the current waitTime in the consul query). This also make sure we don't wait another 30s if we already waited 29s in the blocking call by substracting the number of elapsed seconds. Hopefully this will do what people expect it does and will be safer for existing consul infrastructures.
* consul: improve consul service discovery Related to prometheus#3711 - Add the ability to filter by tag and node-meta in an efficient way (`/catalog/services` allow filtering by node-meta, and returns a `map[string]string` or `service`->`tags`). Tags and nore-meta are also used in `/catalog/service` requests. - Do not require a call to the catalog if services are specified by name. This is important because on large cluster `/catalog/services` changes all the time. - Add `allow_stale` configuration option to do stale reads. Non-stale reads can be costly, even more when you are doing them to a remote datacenter with 10k+ targets over WAN (which is common for federation). - Add `refresh_interval` to minimize the strain on the catalog and on the service endpoint. This is needed because of that kind of behavior from consul: hashicorp/consul#3712 and because a catalog on a large cluster would basically change *all* the time. No need to discover targets in 1sec if we scrape them every minute. - Added plenty of unit tests. Benchmarks ---------- ```yaml scrape_configs: - job_name: prometheus scrape_interval: 60s static_configs: - targets: ["127.0.0.1:9090"] - job_name: "observability-by-tag" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 tag: marathon-user-observability # Used in After refresh_interval: 30s # Used in After+delay relabel_configs: - source_labels: [__meta_consul_tags] regex: ^(.*,)?marathon-user-observability(,.*)?$ action: keep - job_name: "observability-by-name" scrape_interval: "60s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - observability-cerebro - observability-portal-web - job_name: "fake-fake-fake" scrape_interval: "15s" metrics_path: "/metrics" consul_sd_configs: - server: consul.service.par.consul.prod.crto.in:8500 services: - fake-fake-fake ``` Note: tested with ~1200 services, ~5000 nodes. | Resource | Empty | Before | After | After + delay | | -------- |:-----:|:------:|:-----:|:-------------:| |/service-discovery size|5K|85MiB|27k|27k|27k| |`go_memstats_heap_objects`|100k|1M|120k|110k| |`go_memstats_heap_alloc_bytes`|24MB|150MB|28MB|27MB| |`rate(go_memstats_alloc_bytes_total[5m])`|0.2MB/s|28MB/s|2MB/s|0.3MB/s| |`rate(process_cpu_seconds_total[5m])`|0.1%|15%|2%|0.01%| |`process_open_fds`|16|*1236*|22|22| |`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="services"}[5m])`|~0|1|1|*0.03*| |`rate(prometheus_sd_consul_rpc_duration_seconds_count{call="service"}[5m])`|0.1|*80*|0.5|0.5| |`prometheus_target_sync_length_seconds{quantile="0.9",scrape_job="observability-by-tag"}`|N/A|200ms|0.2ms|0.2ms| |Network bandwidth|~10kbps|~2.8Mbps|~1.6Mbps|~10kbps| Filtering by tag using relabel_configs uses **100kiB and 23kiB/s per service per job** and quite a lot of CPU. Also sends and additional *1Mbps* of traffic to consul. Being a little bit smarter about this reduces the overhead quite a lot. Limiting the number of `/catalog/services` queries per second almost removes the overhead of service discovery. * consul: tweak `refresh_interval` behavior `refresh_interval` now does what is advertised in the documentation, there won't be more that one update per `refresh_interval`. It now defaults to 30s (which was also the current waitTime in the consul query). This also make sure we don't wait another 30s if we already waited 29s in the blocking call by substracting the number of elapsed seconds. Hopefully this will do what people expect it does and will be safer for existing consul infrastructures. Signed-off-by: sipian <cs15btech11019@iith.ac.in>
Fixed in Prometheus and Consul |
consul version
for both Client and ServerClient: 0.9.3
Server: 0.9.3
Description of the Issue (and unexpected/desired result)
We have several large clusters (~5k hosts per clusters) connected thru WAN and have some scalability issues. During one of my investigations, I noticed a strange bug:
Reproduction steps
Create this script called consul_check_services_changes.sh
Call it on a non-existing service:
Blocks -> OK
Call it on a real service:
Blocks as long as the service is not updated -> OK
Call it on a service that does not exists but has an existing service as prefix
Basically, we are notified on each change of X-Consul-Index for such services. Thus I suspect a significant but in the implementation. (It should behave the same way as either and existing service or a non-existing service right ?
The text was updated successfully, but these errors were encountered: