-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tempo metrics-generator stops pushing metrics via remote_write to Prometheus after some time #2514
Comments
This is interesting. Can you reproduce this by purposefully forcing your Prometheus pod to move?
Does metrics generator log anything? I'm also surprised we don't have an obvious metric on failures. Does this counter increase: |
It seems like we need a PR to include some useful logging in the processors. Do you have any logs on the remote-write side that might help narrow it down? Based on the logs above, it seems that the series are still being generated. |
Do you know what version of Tempo you're running? Can you try tip of main This PR correctly handles sharding when doing RW. I'm wondering if the metrics generator is falling behind (b/c it's not sharding) and prom is rejecting the samples b/c they're too old. Also, the metrics generator publishes prometheus remote write metrics that start with the prefix below. This includes total written bytes, failures, exemplars, etc.
Perhaps dig into these and see if there's a clue on what is causing the failures. |
I'm using Tempo (distributed) helm chart I'll try using the tip overnight, (by changing the tag/image used in chart).
EDIT: I have those metrics - before trying new image, I'll wait for the re-occurrence and check the metrics first. |
Some more container logs that may help: Log snippet
|
as there's no way to override the image tag globally, I'll update just the |
unfortunately that commit also ships this and helm chart is not updated to handle it, so when I use the new image/tag just for metricsGenerator it goes into crashlooping with the following error:
same with the tip of |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
this issue still plagues our systems. we're running two replicas of metrics-generator now and still haven't found a plausible cause for it - after some time the metrics just stop appearing in prometheus and there's no error logs or indication on the metrics-generator side that something is wrong. 😢 |
Reopening and labeling keepalive. Are you seeing this on Tempo latest: 2.2.1? |
I'm seeing it in Tempo 2.2.0. |
There will likely be no changes. There were some remote write improvements in 2.2.0, but the 4 small patches in 2.2.1 will unfortunately not fix your issue. So to review:
Some things to try/think about:
I'm sorry you are seeing this issue. We do not see it internally which makes it tough to debug. |
thank you for the writeup! |
@erplsf did this resolve for you? I still keep facing the same issue |
So we do not see this issue and we'll need more help to diagnose. Can you check relevant metrics/logs on the node to see if any resources are saturated that correlate with the failure? disk issues? OOMed pods? syslog errors? etc |
My issue is resovled. Want report for folks they run in de same issue., First of all I have had forget acitvate the metrics sracpping from tempo itself. After activation of the metric scraping and follow the Furthermore in the systemd logs I see After looking at the timestamps of the OS from the spans' producers I see a clock skew of more then 40s Increasing the value of metrics_ingestion_time_range_slack to 60s solves my problem. Why the NTP clocks have this deviation is the responsibility of the customer of course. |
@suikast42, when you refer to 'First of all I have had forget acitvate the metrics sracpping from tempo itself.' What you have in mind? Is it:
Thanks! |
Describe the bug
I use a k8s service (ClusterIp) as a target for metrics-generator remote_write, f.e.
After some time, (I believe this is related to Prometheus pods being relocated to different nodes/restarted), metrics-generator stops shipping new metrics to the Prometheus, as in - no more data comes in - example of one of the metrics queried in Prometheus:
And distributor/ingestion pipeline is still working correctly - I can search and filter and do all the usual operations with traces, but the service graph and metrics generated from them are lost.
Graph showing ingester processing/appending new traces durng that period:
And one more graph showing that distributor does, in fact, send data to mterics-generator:
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I expected metrics-generator to continue pushing metrics.
Environment:
Additional Context
I couldn't find any metrics generated by metrics-generator itself which could help me debug and notice this issue, so maybe a new metric like
remote_writes_failed_total
would make sense here?The text was updated successfully, but these errors were encountered: