Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[[outputs.http]] prometheus remote write metrics are not receiving on Thanos ""unsupported value type"" #12830

Closed
Ishmeet opened this issue Mar 10, 2023 · 5 comments
Labels
bug unexpected problem or unintended behavior waiting for response waiting for response from contributor

Comments

@Ishmeet
Copy link

Ishmeet commented Mar 10, 2023

Relevant telegraf.conf

[[outputs.http]]
      url = "http://thanos-receive.thanos2:19291/api/v1/receive"
      timeout = "60s"
      method = "POST"
      data_format = "prometheusremotewrite"
      #data_format = "prometheus"
      insecure_skip_verify = true
      use_batch_format = true
      content_encoding = "snappy"
      non_retryable_statuscodes = [409, 413] --> This does not have any effect on removing
      [outputs.http.headers]
        cluster_name = "telegraf"
        Content-Type = "application/x-protobuf"
        Content-Encoding = "snappy"
        X-Prometheus-Remote-Write-Version = "0.1.2"

[[inputs.prometheus]]
      urls = [
        "http://kube-state-metrics.kube-system.svc.cluster.local:8080/metrics"
      ]
      data_format = "prometheusremotewrite"
      [[inputs.prometheus.tags]]
        cluster = "telegraf"

Logs from Telegraf

[centos@k8node01 ~]$ kubectl logs my-release-telegraf-595c4bc7cf-5rnjm -n default
2023-03-10T03:34:23Z I! Using config file: /etc/telegraf/telegraf.conf
2023-03-10T03:34:23Z I! Starting Telegraf 1.25.3
2023-03-10T03:34:23Z I! Available plugins: 228 inputs, 9 aggregators, 26 processors, 21 parsers, 57 outputs, 2 secret-stores
2023-03-10T03:34:23Z I! Loaded inputs: prometheus
2023-03-10T03:34:23Z I! Loaded aggregators: 
2023-03-10T03:34:23Z I! Loaded processors: enum
2023-03-10T03:34:23Z I! Loaded secretstores: 
2023-03-10T03:34:23Z I! Loaded outputs: http
2023-03-10T03:34:23Z I! Tags enabled: host=telegraf-polling-service
2023-03-10T03:34:23Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf-polling-service", Flush Interval:10s
2023-03-10T03:34:23Z D! [agent] Initializing plugins
2023-03-10T03:34:23Z D! [agent] Connecting outputs
2023-03-10T03:34:23Z D! [agent] Attempting connection to [outputs.http]
2023-03-10T03:34:23Z D! [agent] Successfully connected to outputs.http
2023-03-10T03:34:23Z D! [agent] Starting service inputs
2023-03-10T03:34:30Z D! [outputs.http] Wrote batch of 1000 metrics in 35.28967ms
2023-03-10T03:34:30Z D! [outputs.http] Buffer fullness: 6719 / 10000 metrics
2023-03-10T03:34:30Z D! [outputs.http] Wrote batch of 1000 metrics in 12.434578ms
2023-03-10T03:34:30Z D! [outputs.http] Buffer fullness: 6188 / 10000 metrics
2023-03-10T03:34:30Z D! [outputs.http] Wrote batch of 1000 metrics in 7.152857ms
2023-03-10T03:34:30Z D! [outputs.http] Buffer fullness: 5188 / 10000 metrics
2023-03-10T03:34:33Z D! [outputs.http] Wrote batch of 1000 metrics in 12.945771ms
2023-03-10T03:34:33Z D! [outputs.http] Wrote batch of 1000 metrics in 7.292723ms
2023-03-10T03:34:33Z D! [outputs.http] Wrote batch of 1000 metrics in 7.379668ms
2023-03-10T03:34:33Z D! [outputs.http] Wrote batch of 1000 metrics in 7.396781ms
2023-03-10T03:34:33Z D! [outputs.http] Wrote batch of 1000 metrics in 7.302257ms
2023-03-10T03:34:33Z D! [outputs.http] Wrote batch of 188 metrics in 2.570767ms
2023-03-10T03:34:33Z D! [outputs.http] Buffer fullness: 0 / 10000 metrics
2023-03-10T03:34:40Z D! [outputs.http] Wrote batch of 1000 metrics in 12.181248ms
2023-03-10T03:34:40Z D! [outputs.http] Buffer fullness: 3408 / 10000 metrics
2023-03-10T03:34:40Z D! [outputs.http] Wrote batch of 1000 metrics in 9.080564ms
2023-03-10T03:34:40Z D! [outputs.http] Buffer fullness: 3476 / 10000 metrics
2023-03-10T03:34:40Z D! [outputs.http] Wrote batch of 1000 metrics in 21.454808ms
2023-03-10T03:34:40Z D! [outputs.http] Buffer fullness: 5189 / 10000 metrics
2023-03-10T03:34:40Z D! [outputs.http] Wrote batch of 1000 metrics in 9.879934ms
2023-03-10T03:34:40Z D! [outputs.http] Buffer fullness: 4189 / 10000 metrics
2023-03-10T03:34:43Z D! [outputs.http] Wrote batch of 1000 metrics in 9.991816ms
2023-03-10T03:34:43Z D! [outputs.http] Wrote batch of 1000 metrics in 9.819117ms
2023-03-10T03:34:43Z D! [outputs.http] Wrote batch of 1000 metrics in 10.283911ms
2023-03-10T03:34:43Z D! [outputs.http] Wrote batch of 1000 metrics in 15.871428ms
2023-03-10T03:34:43Z D! [outputs.http] Wrote batch of 189 metrics in 2.509724ms
2023-03-10T03:34:43Z D! [outputs.http] Buffer fullness: 0 / 10000 metrics

System info

telegraf:1.25-alpine

Docker

No response

Steps to reproduce

  1. Configure Inputs.Prometheus to read from kube-state-metrics.
  2. Configure outputs.http with data_format = "prometheusremotewrite" and thanos url.
  3. Check on Thanos receiver.
    ...

Expected behavior

Metrics should have been received on Thanos. Or some error/warning should have come on telegraf.

Actual behavior

No error on Thanos and Telegraf but still metrics are not receiving on Thanos

Additional info

Error logs on Thanos

level=warn ts=2023-03-10T07:27:14.137749483Z caller=writer.go:131 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" numDropped=206
level=debug ts=2023-03-10T07:33:34.170158185Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_deployment_status_condition_gauge\", condition=\"Progressing\", deployment=\"my-release-telegraf\", host=\"telegraf-polling-service\", namespace=\"default\", status=\"false\", url=\"[http://kube-state-metrics.kube-system.svc.cluster.local:8080/metrics\"}"](http://kube-state-metrics.kube-system.svc.cluster.local:8080/metrics/%22%7D%22) sample="unsupported value type"
@Ishmeet Ishmeet added the bug unexpected problem or unintended behavior label Mar 10, 2023
@powersj
Copy link
Contributor

powersj commented Mar 10, 2023

No error on Thanos and Telegraf but still metrics are not receiving on Thanos

Why do you think this is an issue in telegraf?

Telegraf clearly is getting a valid 2xx return code back from your HTTP endpoint, otherwise, it would be erroring and not claiming successful writes.

If you can provide additional logs or information that point to an issue in Telegraf, we would be very happy to help resolve any issues. However, without additional information (e.g. debug logs from Thanos showing issues with the request), it is not clear where to take this report.

@powersj powersj added the waiting for response waiting for response from contributor label Mar 10, 2023
@Ishmeet
Copy link
Author

Ishmeet commented Mar 10, 2023

No error on Thanos and Telegraf but still metrics are not receiving on Thanos

Why do you think this is an issue in telegraf?

Telegraf clearly is getting a valid 2xx return code back from your HTTP endpoint, otherwise, it would be erroring and not claiming successful writes.

If you can provide additional logs or information that point to an issue in Telegraf, we would be very happy to help resolve any issues. However, without additional information (e.g. debug logs from Thanos showing issues with the request), it is not clear where to take this report.

level=warn ts=2023-03-10T07:27:14.137749483Z caller=writer.go:131 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" numDropped=206

More logs

level=debug ts=2023-03-10T07:33:34.170158185Z caller=writer.go:88 component=receive component=receive-writer msg="Out of order sample" lset="{__name__=\"kube_deployment_status_condition_gauge\", condition=\"Progressing\", deployment=\"my-release-telegraf\", host=\"telegraf-polling-service\", namespace=\"default\", status=\"false\", url=\"[http://kube-state-metrics.kube-system.svc.cluster.local:8080/metrics\"}"](http://kube-state-metrics.kube-system.svc.cluster.local:8080/metrics/%22%7D%22) sample="unsupported value type"

Input plugin:

    [[inputs.prometheus]]
      urls = [
        "http://kube-state-metrics.kube-system.svc.cluster.local:8080/metrics"
      ]
      data_format = "prometheusremotewrite"
      [[inputs.prometheus.tags]]
        cluster = "telegraf"

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Mar 10, 2023
@Ishmeet Ishmeet changed the title [[outputs.http]] prometheus remote write metrics are not receiving on Thanos [[outputs.http]] prometheus remote write metrics are not receiving on Thanos ""unsupported value type"" Mar 10, 2023
@powersj
Copy link
Contributor

powersj commented Mar 10, 2023

Error on ingesting out-of-order samples" numDropped=206

This has come up before with this serializer. In general metrics are not ordered in Telegraf. Let me chat with @srebhan and get back to you.

@powersj
Copy link
Contributor

powersj commented Mar 13, 2023

Hi,

We chatted about this issue a bit more today. While we could possibly order individual batches, we ultimately cannot order all your metrics that you might send. Depending on the situation, 1) that might mean that you have metrics in the buffer, that will get split up and be across different times; 2) you could push data from different inputs with newer timestamps where you could come across this as well; 3) your inputs could just be impacted due to wrong times set see thanos-io/thanos#4831 for a longer discussion as well.

Reading through https://thanos.io/tip/operating/troubleshooting.md/#out-of-order-samples-error it seems that the key thing is the set of labels used on the metrics are unique from each deployment, and ensuring you avoid duplicate metrics.

It is not clear from this what Telegraf could further do to aid in resolving this issue, nor does it seem that there is a one singular change or fix for this issue.

@powersj powersj added the waiting for response waiting for response from contributor label Mar 13, 2023
@telegraf-tiger
Copy link
Contributor

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Page. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior waiting for response waiting for response from contributor
Projects
None yet
Development

No branches or pull requests

2 participants