Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC drops all the series from the head-block every 2 hours #4115

Closed
semyonslepov opened this Issue Apr 26, 2018 · 25 comments

Comments

Projects
None yet
3 participants
@semyonslepov
Copy link

semyonslepov commented Apr 26, 2018

Bug Report

What did you do?

Normal Prometheus operation with ~1000000 series in the head block

What did you expect to see?

Smooth operation with all the data available ~99.99% of time.
Some percentage of time-series being dropped from the head block on every GC execution but not all of them.

What did you see instead? Under which circumstances?

It seems that every 2 hours all the series are dropped from the head block and then restored in ~10 minutes.
This event coincides with the GC execution.
Time-series data from time to time becomes unavailable during this "drop-restore" period.
Relevant metrics:

prometheus_tsdb_head_series_created_total
prometheus_tsdb_head_series_removed_total
prometheus_tsdb_head_series

And it happens every 2 hours:

series_removed_12h

Environment

  • System information:

Linux 4.9.91-40.57.amzn1.x86_64 x86_64

  • Prometheus version:
prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
  build user:       root@149e5b3f0829
  build date:       20180314-14:15:45
  go version:       go1.10
  • Prometheus configuration file:
global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s
  external_labels:
    account: aws_account_name
    monitor: prometheus-stack
    region: eu-west-1
scrape_configs:
- job_name: node_exporter
  scrape_interval: 30s
  scrape_timeout: 5s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - ip-10-58-130-116:9100
  relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: ^(.*)$
    target_label: tier
    replacement: p4
    action: replace
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: .*
    target_label: tier
    replacement: p4
    action: replace
- job_name: prometheus-server
  scrape_interval: 30s
  scrape_timeout: 5s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - ip-10-58-130-116:9090
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: .*
    target_label: tier
    replacement: p4
    action: replace
- job_name: federate_lts
  honor_labels: true
  params:
    match[]:
    - '{job=~"[a-z].*",job!="federate_lts"}'
  scrape_interval: 1h
  scrape_timeout: 2m
  metrics_path: /federate
  scheme: http
  static_configs:
  - targets:
    - internal-prometheu-Internal-94VCZL74ZSUN-1346284009.eu-west-1.elb.amazonaws.com:9090
  • Logs:
level=info ts=2018-04-26T03:00:00.789789541Z caller=compact.go:393 component=tsdb msg="compact blocks" count=1 mint=1524700800000 maxt=1524708000000
level=info ts=2018-04-26T03:00:20.711380725Z caller=head.go:348 component=tsdb msg="head GC completed" duration=2.95190116s
level=info ts=2018-04-26T03:00:22.715362637Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=2.003904086s
level=info ts=2018-04-26T03:00:23.670799123Z caller=compact.go:393 component=tsdb msg="compact blocks" count=3 mint=1524679200000 maxt=1524700800000
level=warn ts=2018-04-26T03:07:03.171395946Z caller=scrape.go:932 component="scrape manager" scrape_pool=federate_lts target="http://internal-prometheu-Internal-94VCZL74ZSUN-1346284009.eu-west-1.elb.amazonaws.com:9090/federate?match%5B%5D=%7Bjob%3D~%22%5Ba-z%5D.%2A%22%2Cjob%21%3D%22federate_lts%22%7D" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=11585
level=warn ts=2018-04-26T04:06:50.777899371Z caller=scrape.go:932 component="scrape manager" scrape_pool=federate_lts target="http://internal-prometheu-Internal-94VCZL74ZSUN-1346284009.eu-west-1.elb.amazonaws.com:9090/federate?match%5B%5D=%7Bjob%3D~%22%5Ba-z%5D.%2A%22%2Cjob%21%3D%22federate_lts%22%7D" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=26652
level=info ts=2018-04-26T05:00:00.789822667Z caller=compact.go:393 component=tsdb msg="compact blocks" count=1 mint=1524708000000 maxt=1524715200000
level=info ts=2018-04-26T05:00:20.636595412Z caller=head.go:348 component=tsdb msg="head GC completed" duration=2.961173699s
level=info ts=2018-04-26T05:00:22.712209338Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=2.07554931s
level=warn ts=2018-04-26T05:07:02.716418894Z caller=scrape.go:932 component="scrape manager" scrape_pool=federate_lts target="http://internal-prometheu-Internal-94VCZL74ZSUN-1346284009.eu-west-1.elb.amazonaws.com:9090/federate?match%5B%5D=%7Bjob%3D~%22%5Ba-z%5D.%2A%22%2Cjob%21%3D%22federate_lts%22%7D" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=14158
level=warn ts=2018-04-26T06:06:50.703605369Z caller=scrape.go:932 component="scrape manager" scrape_pool=federate_lts target="http://internal-prometheu-Internal-94VCZL74ZSUN-1346284009.eu-west-1.elb.amazonaws.com:9090/federate?match%5B%5D=%7Bjob%3D~%22%5Ba-z%5D.%2A%22%2Cjob%21%3D%22federate_lts%22%7D" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=21808
level=info ts=2018-04-26T07:00:00.789336883Z caller=compact.go:393 component=tsdb msg="compact blocks" count=1 mint=1524715200000 maxt=1524722400000
level=info ts=2018-04-26T07:00:17.381170201Z caller=head.go:348 component=tsdb msg="head GC completed" duration=2.120526312s
level=info ts=2018-04-26T07:00:19.36616827Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=1.984937139s
level=warn ts=2018-04-26T07:07:02.628056348Z caller=scrape.go:932 component="scrape manager" scrape_pool=federate_lts target="http://internal-prometheu-Internal-94VCZL74ZSUN-1346284009.eu-west-1.elb.amazonaws.com:9090/federate?match%5B%5D=%7Bjob%3D~%22%5Ba-z%5D.%2A%22%2Cjob%21%3D%22federate_lts%22%7D" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=10883
level=warn ts=2018-04-26T08:06:50.339291254Z caller=scrape.go:932 component="scrape manager" scrape_pool=federate_lts target="http://internal-prometheu-Internal-94VCZL74ZSUN-1346284009.eu-west-1.elb.amazonaws.com:9090/federate?match%5B%5D=%7Bjob%3D~%22%5Ba-z%5D.%2A%22%2Cjob%21%3D%22federate_lts%22%7D" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=40776
level=info ts=2018-04-26T09:00:00.79663534Z caller=compact.go:393 component=tsdb msg="compact blocks" count=1 mint=1524722400000 maxt=1524729600000
level=info ts=2018-04-26T09:00:17.869411821Z caller=head.go:348 component=tsdb msg="head GC completed" duration=2.191874292s
level=info ts=2018-04-26T09:00:19.864876385Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=1.995408902s
level=info ts=2018-04-26T09:00:20.843168795Z caller=compact.go:393 component=tsdb msg="compact blocks" count=3 mint=1524700800000 maxt=1524722400000
level=warn ts=2018-04-26T09:07:02.5320854Z caller=scrape.go:932 component="scrape manager" scrape_pool=federate_lts target="http://internal-prometheu-Internal-94VCZL74ZSUN-1346284009.eu-west-1.elb.amazonaws.com:9090/federate?match%5B%5D=%7Bjob%3D~%22%5Ba-z%5D.%2A%22%2Cjob%21%3D%22federate_lts%22%7D" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=45838
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 26, 2018

this looks normal to me.
the head is a temp in memory object and the series are persisted to disk (I think every 2h by default)

The in memory head is used to avoid frequent writes to disk.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 26, 2018

regarding missing metrics you might be hitting this, but I doubt it.
prometheus/tsdb#260
which already has an open PR

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 26, 2018

scrape_interval: 1h

This does not look like a sane setup, and the logs confirm you're doing something odd.

@semyonslepov

This comment has been minimized.

Copy link
Author

semyonslepov commented Apr 26, 2018

@brian-brazil it's a higher-level Prometheus scraping another one with less frequency
what's the oddness of such an approach and choosen scrape_interval?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 26, 2018

are you actually missing any metrics when you run a query?

the head gets persisted to disk when it is cleared so the metrics aren't lost.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 26, 2018

The maxmimum sane interval is 2m, and the logs indicate metrics are clashing. I'd presume that this behaviour is due to your setup, and not an issue with Prometheus.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 26, 2018

it took me a while to realise the real issue 😄

@semyonslepov just had a quick look at the code for the scraping and it seems that if the target in the federate_lts exposes metrics with a timestamp that doesn't overlap with the current head time range the samples won't get ingested.
so worth checking if the target doesn't expose samples with an incorrect timestamp.

@semyonslepov

This comment has been minimized.

Copy link
Author

semyonslepov commented May 7, 2018

Got the real data-loss issue with more sane configuration and it again correlates with the GC execution.

Configuration:

global:
  scrape_interval: 30s
  scrape_timeout: 10s
  evaluation_interval: 30s
  external_labels:
    account: spt-observability-pro
    monitor: prometheus-stack
    region: eu-west-1
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - ip-10-58-130-53:9093
    scheme: http
    timeout: 10s
rule_files:
- /etc/prometheus/alert_configs/*.yml
- /etc/prometheus/recording_rules/*.yml
scrape_configs:
- job_name: alertmanager
  scrape_interval: 30s
  scrape_timeout: 5s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - ip-10-58-130-53:9093
- job_name: node_exporter
  scrape_interval: 30s
  scrape_timeout: 5s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - ip-10-58-130-53:9100
  relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: ^(.*)$
    target_label: tier
    replacement: p1
    action: replace
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: .*
    target_label: tier
    replacement: p1
    action: replace
- job_name: health-check-target
  scrape_interval: 30s
  scrape_timeout: 5s
  metrics_path: /_/metrics
  scheme: http
  static_configs:
  - targets:
    - ip-10-58-130-53:8181
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: .*
    target_label: tier
    replacement: p1
    action: replace
- job_name: grafana
  honor_labels: true
  scrape_interval: 30s
  scrape_timeout: 5s
  metrics_path: /metrics
  scheme: http
  ec2_sd_configs:
  - region: eu-west-1
    refresh_interval: 1m
    port: 3000
  relabel_configs:
  - source_labels: [__meta_ec2_tag_Name]
    separator: ;
    regex: metrics-platform-frontend
    replacement: $1
    action: keep
- job_name: prometheus-stack
  scrape_interval: 30s
  scrape_timeout: 15s
  metrics_path: /metrics
  scheme: http
  ec2_sd_configs:
  - region: eu-west-1
    refresh_interval: 1m
    port: 9090
  relabel_configs:
  - source_labels: [__meta_ec2_tag_prometheus_tier]
    separator: ;
    regex: ^p.+$
    replacement: $1
    action: keep
  - source_labels: [__meta_ec2_tag_prometheus_tier]
    separator: ;
    regex: ^(.*)$
    target_label: tier
    replacement: $1
    action: replace
- job_name: datadog
  scrape_interval: 5m
  scrape_timeout: 5s
  metrics_path: /metrics
  scheme: http
  static_configs:
  - targets:
    - ip-10-58-130-53:9091
- job_name: obs_p0_federate
  honor_labels: true
  params:
    match[]:
    - '{job=~"[a-z].*"}'
  scrape_interval: 30s
  scrape_timeout: 15s
  metrics_path: /federate
  scheme: https
  static_configs:
  - targets:
    - prometheus-aws-eu-central-1.spt-mkt-trust-pro.schibsted.io:443
    - prometheus-aws-eu-west-1.spt-mkt-trust-pro.schibsted.io:443
    - prometheus-aws-eu-west-1.spt-observability-pro.schibsted.io:443
    - prometheus-aws-eu-west-1.spt-infra-sre-pro.schibsted.io:443
    - prometheus-aws-eu-west-1.cre-pro.schibsted.io:443
    - prometheus-aws-eu-west-1.blocket-insight-pro.schibsted.io:443
    - prometheus-creservices-pro.ingress.cre-pro.schibsted.io:443
    - prometheus-aws-eu-west-1.mp-pro.schibsted.io:443
    - prometheus-aws-eu-west-3.search-engineering-dev.schibsted.io:443
    - prometheus-aws-eu-west-3.search-engineering-pro.schibsted.io:443
    - prometheus-mp-mads-dev.ingress.cre-pro.schibsted.io:443
  metric_relabel_configs:
  - separator: ;
    regex: pod_template_hash
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: fiaas_deployed_by
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: fiaas_deployment_id
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: fiaas_version
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: kubernetes_pod_name
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: awsHostId
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: awsHostname
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: fiaas_app_deployed_at
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: fiaas_created_at
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: kube_aws_coreos_com_autoscalinggroup
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: kube_aws_coreos_com_launchconfiguration
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: ec2_autoscaling_group
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: label_kube_aws_coreos_com_launchconfiguration
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: label_kube_aws_coreos_com_autoscalinggroup
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: container_id
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: image_id
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: pod
    replacement: $1
    action: labeldrop
- job_name: ext_p0_federate
  honor_labels: true
  params:
    match[]:
    - '{job=~"[a-z].*"}'
  scrape_interval: 30s
  scrape_timeout: 15s
  metrics_path: /federate
  scheme: https
  static_configs:
  - targets:
    - prometheus-aws-eu-west-1.rkt.schibsted.io:443
    - prometheus-aws-eu-west-1.ami-store.schibsted.io:443
    - delivery-prometheus-local.ingress.cre-pro.schibsted.io:443
    - prometheus.sol-osl01.finntech.no:443
    - prometheusadin.finntech.no:443
    - prometheusadout.finntech.no:443
    - prometheuscustomer.finntech.no:443
    - prometheusfrontend.finntech.no:443
    - prometheuspenger.finntech.no:443
    - prometheussearch.finntech.no:443
    - prometheusk8s.prod.finntech.no:443
    - metrics.kufar.by:443
    - prometheus-aws-eu-west-1.devrel-dev.schibsted.io:443
    - prometheus-aws-eu-west-1.search-engineering-dev.schibsted.io:443
  metric_relabel_configs:
  - separator: ;
    regex: pod_template_hash
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: fiaas_deployed_by
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: fiaas_deployment_id
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: fiaas_version
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: kubernetes_pod_name
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: awsHostId
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: awsHostname
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: fiaas_app_deployed_at
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: fiaas_created_at
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: kube_aws_coreos_com_autoscalinggroup
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: kube_aws_coreos_com_launchconfiguration
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: ec2_autoscaling_group
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: label_kube_aws_coreos_com_launchconfiguration
    replacement: $1
    action: labeldrop
  - separator: ;
    regex: label_kube_aws_coreos_com_autoscalinggroup
    replacement: $1
    action: labeldrop
- job_name: obs_p0_prom_metrics
  honor_labels: true
  scrape_interval: 30s
  scrape_timeout: 15s
  metrics_path: /metrics
  scheme: https
  static_configs:
  - targets:
    - prometheus-aws-eu-central-1.spt-mkt-trust-pro.schibsted.io:443
    - prometheus-aws-eu-west-1.spt-mkt-trust-pro.schibsted.io:443
    - prometheus-aws-eu-west-1.spt-observability-pro.schibsted.io:443
    - prometheus-aws-eu-west-1.spt-infra-sre-pro.schibsted.io:443
    - prometheus-aws-eu-west-1.cre-pro.schibsted.io:443
    - prometheus-aws-eu-west-1.blocket-insight-pro.schibsted.io:443
    - prometheus-creservices-pro.ingress.cre-pro.schibsted.io:443
    - prometheus-aws-eu-west-1.mp-pro.schibsted.io:443
    - prometheus-aws-eu-west-3.search-engineering-dev.schibsted.io:443
    - prometheus-aws-eu-west-3.search-engineering-pro.schibsted.io:443
    - prometheus-mp-mads-dev.ingress.cre-pro.schibsted.io:443
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: .*
    target_label: tier
    replacement: p0
    action: replace
- job_name: ext_p0_prom_metrics
  honor_labels: true
  scrape_interval: 30s
  scrape_timeout: 15s
  metrics_path: /metrics
  scheme: https
  static_configs:
  - targets:
    - prometheus-aws-eu-west-1.rkt.schibsted.io:443
    - prometheus-aws-eu-west-1.ami-store.schibsted.io:443
    - delivery-prometheus-local.ingress.cre-pro.schibsted.io:443
    - prometheus.sol-osl01.finntech.no:443
    - prometheusadin.finntech.no:443
    - prometheusadout.finntech.no:443
    - prometheuscustomer.finntech.no:443
    - prometheusfrontend.finntech.no:443
    - prometheuspenger.finntech.no:443
    - prometheussearch.finntech.no:443
    - prometheusk8s.prod.finntech.no:443
    - metrics.kufar.by:443
    - prometheus-aws-eu-west-1.devrel-dev.schibsted.io:443
    - prometheus-aws-eu-west-1.search-engineering-dev.schibsted.io:443
  metric_relabel_configs:
  - source_labels: [__name__]
    separator: ;
    regex: .*
    target_label: tier
    replacement: p0
    action: replace

up metric from one of the federated targets:

up_2018-05-07

scrape_samples_scraped from the same target:

scraped_total_2018-05-07

Log messages on the federating instance:

level=info ts=2018-05-07T13:00:01.543596388Z caller=compact.go:393 component=tsdb msg="compact blocks" count=1 mint=1525687200000 maxt=1525694400000
level=info ts=2018-05-07T13:00:26.694997426Z caller=head.go:348 component=tsdb msg="head GC completed" duration=1.400530985s
level=info ts=2018-05-07T13:00:32.643341791Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=5.94827516s
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 7, 2018

That's a failed scrape, which is not data loss. See also https://www.robustperception.io/federation-what-is-it-good-for/

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 7, 2018

@brian-brazil how did you figure it is a missed scrape?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 7, 2018

up is 0.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 7, 2018

thanks I didn't notice the screenshot untill now.

@semyonslepov this looks different than the original issue with the ingestion logs.
..."Error on ingesting samples that are too old or are too far into the future" num_dropped=40776

@semyonslepov did you check if the target exposes the correct timestamps?

@semyonslepov

This comment has been minimized.

Copy link
Author

semyonslepov commented May 8, 2018

@krasi-georgiev they do
It seems like a different issue from the first glance, right. However, what I was pointing to is that scrapes are missed from time to time and it correlates with the GC being in process.
If it's considered as a normal situation (and taking into account the article mentioned by @brian-brazil which we probably missed or misunderstood before), there is not a lot to talk about. In this case this issue can be closed.
Thanks for explanations anyway.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 8, 2018

I am still a bit curious why this happens.
looking at the code ErrOutOfBounds is returned when you try to ingest metrics outside the current time window of the in-memory buffer (or the head) .

if t < a.mint {
return 0, ErrOutOfBounds
}

I just tried the federation and it returns the mos recent ingested sample+timestamp so even if you scrape the main Prometheus server every 2h (like in your original config) you should still be getting a recent timestamp which should be within the current head time range for ingestion.

The only explanation would be if the main server that you are federating returns a timestamp that is too old so it cannot be ingested in the current head

@semyonslepov

This comment has been minimized.

Copy link
Author

semyonslepov commented May 9, 2018

I would imagine that it's an issue for the first configuration with per-1h scrape. However, it still happens for per-30s scrapes. Moreover, in this case, it doesn't happen randomly, it happens at the same time as GC comes.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 9, 2018

I think dropping at gc is normal as this is when the code verifies that the metrics have the correct timestamp sequence and drops the once that are not within the head time range.
Still curious as of why these would be marked as OutOfBounds if the federated Prometheus returns correct timestamps.
I might be wrong , but don't see any problem how often you scrape the federated instance as long as it returns recent timestamps for all metrics.
I would be interested to read some more insights on this.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 15, 2018

@semyonslepov is it behaving as expected after you changed the configs?

@semyonslepov

This comment has been minimized.

Copy link
Author

semyonslepov commented May 15, 2018

@krasi-georgiev I didn't change configuration, there are two different configurations on two different hosts, I just wanted to emphasize that they both have the same issue

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 5, 2018

I am running out of ideas on this one.
@semyonslepov do you have any new info on the issue?

@semyonslepov

This comment has been minimized.

Copy link
Author

semyonslepov commented Jun 6, 2018

@krasi-georgiev no news, scrapes still fail leaving gaps afterwards (it happens together with out of bounds error or without it, out of bounds errors can happen separately from time to time but it doesn't lead to such a big gap - OutOfBound counters change is too low at these moments).

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 6, 2018

hm , any idea how we can troubleshoot this together as with the current details it is a bit of guessing.

@semyonslepov

This comment has been minimized.

Copy link
Author

semyonslepov commented Jun 11, 2018

@krasi-georgiev I will try to increase scrape timeouts and look how it works, probably it's all about too heavy federation target

@semyonslepov

This comment has been minimized.

Copy link
Author

semyonslepov commented Jun 18, 2018

Timeout increase didn't help (and scrape_duration_seconds is within some reasonable range of 0-2 seconds). Just following messages at the time of failed scrape:

level=warn ts=2018-06-17T16:46:59.808406963Z caller=scrape.go:932 component="scrape manager" scrape_pool=federate_lts target="http://internal-prometheu-Internal-94VCZL74ZSUN-1346284
009.eu-west-1.elb.amazonaws.com:9090/federate?match%5B%5D=%7Bjob%3D~%22%5Ba-z%5D.%2A%22%2Cjob%21%3D%22federate_lts%22%7D" msg="Error on ingesting samples that are too old or are too
 far into the future" num_dropped=47427
level=warn ts=2018-06-17T16:46:59.808488624Z caller=scrape.go:697 component="scrape manager" scrape_pool=federate_lts target="http://internal-prometheu-Internal-94VCZL74ZSUN-1346284
009.eu-west-1.elb.amazonaws.com:9090/federate?match%5B%5D=%7Bjob%3D~%22%5Ba-z%5D.%2A%22%2Cjob%21%3D%22federate_lts%22%7D" msg="append failed" err="out of bounds"
level=info ts=2018-06-17T17:00:00.36112907Z caller=compact.go:393 component=tsdb msg="compact blocks" count=1 mint=1529244000000 maxt=1529251200000
level=info ts=2018-06-17T17:00:18.261083468Z caller=head.go:348 component=tsdb msg="head GC completed" duration=2.878933852s
level=info ts=2018-06-17T17:00:20.346929019Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=2.085781882s

And empty result when trying to fetch the test metric's value (we have a monitoring script on it, it asks Prometheus API for known metrics every 5 minutes).
I don't have any ideas how to debug it at the moment.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 18, 2018

maybe if you add some debugging info it will give you a clue what samples are dropped but the output would be quite busy so not sure if it will help

func (a *headAppender) Add(lset labels.Labels, t int64, v float64) (uint64, error) {
	if t < a.mint {
		if lset.String() == "..........?? filter by a specific lset" {
			fmt.Println("Dropping")
			fmt.Println(lset)
			fmt.Println(t)
			fmt.Println(v)
		}
		return 0, ErrOutOfBounds
	}

add here

func (a *headAppender) Add(lset labels.Labels, t int64, v float64) (uint64, error) {
if t < a.mint {
return 0, ErrOutOfBounds
}

to compile a binary after the change

git clone .....
cd cmd/prometheus
go build
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 28, 2018

seems stale, feel free to reopen if you think we should revisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.