Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tempo metrics-generator stops pushing metrics via remote_write to Prometheus after some time #2514

Open
erplsf opened this issue May 30, 2023 · 23 comments
Labels
keepalive Label to exempt Issues / PRs from stale workflow

Comments

@erplsf
Copy link

erplsf commented May 30, 2023

Describe the bug
I use a k8s service (ClusterIp) as a target for metrics-generator remote_write, f.e.

storage:
    path: /var/tempo/wal
    remote_write:
    - url: http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/write
    remote_write_flush_deadline: 1m
    wal: null

After some time, (I believe this is related to Prometheus pods being relocated to different nodes/restarted), metrics-generator stops shipping new metrics to the Prometheus, as in - no more data comes in - example of one of the metrics queried in Prometheus:
2023-05-30_12:53:09

And distributor/ingestion pipeline is still working correctly - I can search and filter and do all the usual operations with traces, but the service graph and metrics generated from them are lost.
Graph showing ingester processing/appending new traces durng that period:
2023-05-30_13:02:43

And one more graph showing that distributor does, in fact, send data to mterics-generator:
2023-05-30_13:07:36

To Reproduce
Steps to reproduce the behavior:

  1. Start Tempo (SHA or version) with metrics-generator enabled
  2. Configure the remote_write endpoint
  3. Perform Operations (Read/Write/Others)
  4. Observe metrics not being written

Expected behavior

I expected metrics-generator to continue pushing metrics.

Environment:

  • Infrastructure: AWS EKS, k8s rev: v1.22.17-eks-0a21954
  • Deployment tool: helm (via Terraform)

Additional Context

I couldn't find any metrics generated by metrics-generator itself which could help me debug and notice this issue, so maybe a new metric like remote_writes_failed_total would make sense here?

@joe-elliott
Copy link
Member

After some time, (I believe this is related to Prometheus pods being relocated to different nodes/restarted), metrics-generator stops shipping new metrics to the Prometheus

This is interesting. Can you reproduce this by purposefully forcing your Prometheus pod to move?

I couldn't find any metrics generated by metrics-generator itself which could help me debug and notice this issue, so maybe a new metric like remote_writes_failed_total would make sense here?

Does metrics generator log anything? I'm also surprised we don't have an obvious metric on failures. Does this counter increase: tempo_metrics_generator_registry_collections_failed_total ?

@erplsf
Copy link
Author

erplsf commented May 30, 2023

I'll redeploy metrics-generator, so it "acquires" Prometheus again and then will try to reproduce it.

No, the metrics generator doesn't seem to be logging anything interesting, so here's screenshots of last period of it's running
2023-05-30_15:52:08

and the graph for the counter: (even without rate, the raw counter never goes above zero):
2023-05-30_15:52:39

@zalegrala
Copy link
Contributor

It seems like we need a PR to include some useful logging in the processors. Do you have any logs on the remote-write side that might help narrow it down? Based on the logs above, it seems that the series are still being generated.

@erplsf
Copy link
Author

erplsf commented May 31, 2023

Nothing interesting on the Prometheus (receiver) side seems to happen - i don't think it logs anything related to remote writes at all by default.

And for testing/trying to reproduce this.

  1. I've restarted the metrics-generator, and it replayed it's WAL, so now the graph is filled with the missing data:
    2023-05-31_10:43:07
    2023-05-31_10:42:29
  2. Restarting prometheus created a small gap in the metrics, but after both Prometheus pods (we run 2 replicas), came up, metrics-generator started writing again, although without filling in the gap. So I guess the issue isn't related clearly to the Prometheus restarts/pod ip changes.
    2023-05-31_10:45:21

Any more ideas what I can do here? This is pretty major blocker for us in evaluating Tempo, as with this bug present we may need to resort to such hacks as periodically restarting metrics-generator so our dashboards work/can be navigated correctly.

Meanwhile, I'll try to keep an eye on this metric and catch when/how this bug manifests itself again.

@joe-elliott
Copy link
Member

Do you know what version of Tempo you're running? Can you try tip of main grafana/tempo:main-1d84273? We merged this PR which is not in 2.1:

#2463

This PR correctly handles sharding when doing RW. I'm wondering if the metrics generator is falling behind (b/c it's not sharding) and prom is rejecting the samples b/c they're too old.

Also, the metrics generator publishes prometheus remote write metrics that start with the prefix below. This includes total written bytes, failures, exemplars, etc.

prometheus_remote_storage...

Perhaps dig into these and see if there's a clue on what is causing the failures.

@erplsf
Copy link
Author

erplsf commented May 31, 2023

I'm using Tempo (distributed) helm chart v1.4.1, which is AFAIK mapped to 2.1.1 Tempo version.

I'll try using the tip overnight, (by changing the tag/image used in chart).

Weird, I'm scraping the metrics-generator endpoints, but I don't have any prometheus_remote_storage metric in my instance.

EDIT: I have those metrics - before trying new image, I'll wait for the re-occurrence and check the metrics first.

@erplsf
Copy link
Author

erplsf commented May 31, 2023

metrics-generator stopped writing metrics at around 22 - the pod was restarted (evicted by karpenter due to rebalancing), the only interesting metric I could find that had some values is prometheus_remote_storage_enqueue_retries_total - but it seems it was/is steadily increasing since the beginning of Tempo deployment:
2023-05-31_22:37:05

Both "failed" metrics are stable at zero:
2023-05-31_22:38:48
2023-05-31_22:38:24

For now I'll restart the pods again, as is, hopefully to replay the WAL, and will see the captured logs from the pods, maybe they say something more.

And in the morning I'll try the new image/tag.

@erplsf
Copy link
Author

erplsf commented May 31, 2023

Some more container logs that may help:

Log snippet
level=info ts=2023-05-31T20:49:01.514585942Z caller=main.go:215 msg="initialising OpenTracing tracer"
level=info ts=2023-05-31T20:49:01.529641413Z caller=main.go:102 msg="Starting Tempo" version="(version=2.1.1, branch=HEAD, revision=4157d7620)"
level=info ts=2023-05-31T20:49:01.531177797Z caller=server.go:334 http=:3100 grpc=:9095 msg="server listening on addresses"
level=info ts=2023-05-31T20:49:01.640087Z caller=memberlist_client.go:437 msg="Using memberlist cluster label and node name" cluster_label= node=tempo-metrics-generator-5c675696bb-v69kz-580085e2
level=info ts=2023-05-31T20:49:01.640467134Z caller=module_service.go:82 msg=initialising module=internal-server
level=info ts=2023-05-31T20:49:01.640557809Z caller=module_service.go:82 msg=initialising module=server
level=info ts=2023-05-31T20:49:01.640896156Z caller=module_service.go:82 msg=initialising module=memberlist-kv
level=info ts=2023-05-31T20:49:01.640912915Z caller=module_service.go:82 msg=initialising module=overrides
level=info ts=2023-05-31T20:49:01.641315183Z caller=module_service.go:82 msg=initialising module=usage-report
level=info ts=2023-05-31T20:49:01.641723235Z caller=memberlist_client.go:543 msg="memberlist fast-join starting" nodes_found=1 to_join=2
level=info ts=2023-05-31T20:49:01.641822883Z caller=module_service.go:82 msg=initialising module=metrics-generator
level=info ts=2023-05-31T20:49:01.665417683Z caller=memberlist_client.go:563 msg="memberlist fast-join finished" joined_nodes=7 elapsed_time=23.696354ms
level=info ts=2023-05-31T20:49:01.665464707Z caller=memberlist_client.go:576 msg="joining memberlist cluster" join_members=tempo-gossip-ring
level=info ts=2023-05-31T20:49:01.665532522Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=tempo-metrics-generator-5c675696bb-v69kz ring=metrics-generator
level=info ts=2023-05-31T20:49:01.665790893Z caller=app.go:188 msg="Tempo started"
level=info ts=2023-05-31T20:49:01.680562849Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=7 elapsed_time=15.098448ms
level=info ts=2023-05-31T20:49:01.925215784Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/var/tempo/wal/single-tenant
ts=2023-05-31T20:49:01.925941912Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=6e4ccb url=http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting WAL watcher" queue=6e4ccb
ts=2023-05-31T20:49:01.925972432Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=6e4ccb url=http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting scraped metadata watcher"
ts=2023-05-31T20:49:01.92606343Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=6e4ccb url=http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/write msg="Replaying WAL" queue=6e4ccb
ts=2023-05-31T20:49:01.926171172Z caller=dedupe.go:112 tenant=single-tenant component=remote level=error remote_name=6e4ccb url=http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/write msg="error tailing WAL" err="failed to fin
d segment for index"
level=info ts=2023-05-31T20:49:01.929552853Z caller=db.go:336 tenant=single-tenant component=wal msg="replaying WAL, this may take a while" dir=/var/tempo/wal/single-tenant/wal
level=info ts=2023-05-31T20:49:01.929725837Z caller=db.go:387 tenant=single-tenant component=wal msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2023-05-31T20:49:16.949711425Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=7754
ts=2023-05-31T20:49:16.966710295Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=6e4ccb url=http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/write msg="Done replaying WAL" duration=10.040017
82s
level=info ts=2023-05-31T20:49:31.939344628Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=8162
level=info ts=2023-05-31T20:49:46.941825334Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=8418
level=info ts=2023-05-31T20:50:01.943571598Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=9788
level=info ts=2023-05-31T20:50:16.949477333Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=12975
level=info ts=2023-05-31T20:50:31.956189483Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=19158
level=info ts=2023-05-31T20:50:46.953215883Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=19312
level=info ts=2023-05-31T20:51:01.951997328Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=19409

@erplsf
Copy link
Author

erplsf commented Jun 1, 2023

as there's no way to override the image tag globally, I'll update just the metrics-generator to the suggested tag and report in a day if this issue re-occurs.

@erplsf
Copy link
Author

erplsf commented Jun 1, 2023

unfortunately that commit also ships this and helm chart is not updated to handle it, so when I use the new image/tag just for metricsGenerator it goes into crashlooping with the following error:

ailed parsing config: failed to parse configFile /conf/tempo.yaml: yaml: unmarshal errors:
  line 109: field tolerate_failed_blocks not found in type frontend.Config

same with the tip of main - 890838b.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2023

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Aug 1, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 17, 2023
@erplsf
Copy link
Author

erplsf commented Aug 23, 2023

this issue still plagues our systems. we're running two replicas of metrics-generator now and still haven't found a plausible cause for it - after some time the metrics just stop appearing in prometheus and there's no error logs or indication on the metrics-generator side that something is wrong. 😢

@joe-elliott
Copy link
Member

Reopening and labeling keepalive. Are you seeing this on Tempo latest: 2.2.1?

@joe-elliott joe-elliott reopened this Aug 23, 2023
@joe-elliott joe-elliott added keepalive Label to exempt Issues / PRs from stale workflow and removed stale Used for stale issues / PRs labels Aug 23, 2023
@erplsf
Copy link
Author

erplsf commented Aug 23, 2023

I'm seeing it in Tempo 2.2.0.
I can update today and report if we'll continue to observe this in 2.2.1.

@joe-elliott
Copy link
Member

I can update today and report if we'll continue to observe this in 2.2.1.

There will likely be no changes. There were some remote write improvements in 2.2.0, but the 4 small patches in 2.2.1 will unfortunately not fix your issue.

So to review:

  • Prometheus stops receiving metrics from Tempo metric generators after X amount of time
  • As far as you can tell there is no instability in the generators or in Prometheus
  • There are no interesting log lines in the generators or Prometheus

Some things to try/think about:

  • Since this issue was filed I have written up a pretty extensive troubleshooting metrics generator issues page:
    https://grafana.com/docs/tempo/next/troubleshooting/metrics-generator/
    Perhaps review these metrics?
  • Is the amount of time between restarting the generators and missing metrics variable? or constant?
  • Is it possible prometheus is dropping the metrics? I'm not familiar with debugging from the prom side, but maybe there's some kind of "max series" setting over there?

I'm sorry you are seeing this issue. We do not see it internally which makes it tough to debug.

@erplsf
Copy link
Author

erplsf commented Aug 24, 2023

thank you for the writeup!
(un)fortunately I'm leaving on vacation tomorrow, but I'll set up a reminder and will get back to you when I gather enough information to hopefully make it easier to debug.

@evilgenius-jp
Copy link

@erplsf did this resolve for you? I still keep facing the same issue

@suikast42
Copy link

suikast42 commented Jun 3, 2024

Same here.

I am on the latest LTGM version on my dev env and on production.

Suddenly overnight metrics-generator stops to produce span metrics in production environment.

I have exactly the same conifigurations in dev and prod. ( The single diffrecne are the storage durations 1d -> 30 d ).

The whole setup of LGTM is with single deployment and local file storage.

I can't see any errors in tempo or mimir. The apps can push the traces and metrics but only the metrics-generator won't produce the metrics from the traces.

Development Env:

image

Production Env:
image

@suikast42
Copy link

suikast42 commented Jun 4, 2024

Update:

Somehow the metrics genertor working few hours at night and stop again
image

@joe-elliott
Copy link
Member

So we do not see this issue and we'll need more help to diagnose. Can you check relevant metrics/logs on the node to see if any resources are saturated that correlate with the failure? disk issues? OOMed pods? syslog errors? etc

@suikast42
Copy link

So we do not see this issue and we'll need more help to diagnose. Can you check relevant metrics/logs on the node to see if any resources are saturated that correlate with the failure? disk issues? OOMed pods? syslog errors? etc

Have check all this out.

The logs says there is nothing todo:

level=info ts=2024-06-04T15:23:22.388783918Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:23:37.388506404Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:23:52.389075694Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:24:07.388806274Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:24:22.389042666Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:24:37.389053244Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:24:52.388918437Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:25:07.388334347Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:25:22.388419881Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:25:37.388802937Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:25:52.388655141Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:26:07.388244935Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:26:22.389210792Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:26:37.388696048Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:26:52.388249678Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:27:07.388179911Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:27:22.388147373Z caller=registry.go:257 tenant=single-tenant msg="deleted stale series" active_series=0
level=info ts=2024-06-04T15:27:22.388251415Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:27:37.388463094Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:27:52.388821661Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:28:07.388782982Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:28:17.156958422Z caller=poller.go:241 msg="writing tenant index" tenant=single-tenant metas=2 compactedMetas=2
level=info ts=2024-06-04T15:28:17.158943884Z caller=poller.go:136 msg="blocklist poll complete" seconds=0.002474755
level=info ts=2024-06-04T15:28:22.388833133Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:28:37.388965088Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:28:52.38916509Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:29:07.388431511Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:29:22.388827088Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:29:37.388112415Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:29:52.388795907Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-06-04T15:30:07.38924223Z caller=registry.go:236 tenant=single-tenant msg="collecting metrics" active_series=0

The traces on the other hand are there

image

How can I dig deeper in this issue ?

@suikast42
Copy link

My issue is resovled. Want report for folks they run in de same issue.,

First of all I have had forget acitvate the metrics sracpping from tempo itself. After activation of the metric scraping and follow the
Troubleshoot metrics-generator docu I see the metric shown below.

image

Furthermore in the systemd logs I see

image

After looking at the timestamps of the OS from the spans' producers I see a clock skew of more then 40s

Increasing the value of metrics_ingestion_time_range_slack to 60s solves my problem.

Why the NTP clocks have this deviation is the responsibility of the customer of course.

@nc-gcz
Copy link

nc-gcz commented Jul 12, 2024

@suikast42, when you refer to 'First of all I have had forget acitvate the metrics sracpping from tempo itself.'

What you have in mind? Is it:

  • Setting up Prometheus to scrape from <component>:3200/metrics endpoint(s) from components?
  • Or is it some specific Grafana Tempo configuration option you are refering to?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive Label to exempt Issues / PRs from stale workflow
Projects
None yet
Development

No branches or pull requests

6 participants