[Metrics generator] Initial implementation #1282

mapno · 2022-02-11T12:34:33Z

What this PR does:

This PR contributes the first implementation of a new component: the metrics-generator. This implementation is based on the design proposal: metrics generator

The metrics-generator is a new optional component, that sits in the write path and derives metrics from traces. If present, the distributor will write received spans to both the ingester and the metrics-generator. The metrics-generator processes spans and writes metrics to a Prometheus datasource using the Prometheus remote write protocol.

You can see an architectural diagram below:

                                                                      │
                                                                      │
                                                                   Ingress
                                                                      │
                                                                      ▼
                                                          ┌──────────────────────┐
                                                          │                      │
                                                          │     Distributor      │
                                                          │                      │
                                                          └──────────────────────┘
                                                                    2│ │1
                                                                     │ │
                                                  ┌──────────────────┘ └────────┐
                                                  │                             │
                                                  ▼                             ▼
┌ ─ ─ ─ ─ ─ ─ ─ ─                     ┏━━━━━━━━━━━━━━━━━━━━━━┓      ┌──────────────────────┐
                 │                    ┃                      ┃      │                      │
│   Prometheus    ◀────Prometheus ────┃  Metrics-generator   ┃      │       Ingester       │◀───Queries────
                 │    Remote Write    ┃                      ┃      │                      │
└ ─ ─ ─ ─ ─ ─ ─ ─                     ┗━━━━━━━━━━━━━━━━━━━━━━┛      └──────────────────────┘
                                                                                │
                                                                                │
                                                                                │
                                                                                ▼
                                                                       ┌─────────────────┐
                                                                       │                 │
                                                                       │     Backend     │
                                                                       │                 │
                                                                       └─────────────────┘

Initially, the metrics-generator will be able to derive two different sets of metrics: span-metrics and service graphs.

Span-metrics: aggregates request, error and duration metrics (RED) from span data. This processor computes the total count and the duration of spans for every unique combination of dimensions. Dimensions can be the service name, the operation, the span kind, the status code and any tag or attribute present in the span.
Service graphs: A service graph is a visual representation of the interrelationships between various services. The service graph processor will analyse trace data and generate metrics describing the relationship between the services. These metrics can be used by e.g. Grafana to draw a service graph. For a more in-depth look, check in here

We encourage everyone to take a look. All feedback is highly appreciated. Feel free to do so in this PR, or by pinging us (@kvrhdn and @mapno) at Slack.

Known issues / TODOs

A list of open tasks that we will tackle after this PR is merged and before the next Tempo release. Once this PR is merged we can create dedicated issues fort them.

The distributor spawns a new goroutine for every push, we should optimize this by for instance introducing a queue. We want a way to limit the amount of pending requests to the metrics-generator in case they are having issues and requests start backing up.
When generating metrics, we should take the timestamp of the received span into account. Spans can take a long time to reach Tempo (requests can be retried multiple times etc.). The timestamp could even be in the future. We should either figure out a way to correct the metrics that have been sent or refuse too old/new spans.
We should add controls to limit the amount of series we generate/remote write.

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…dd scrape endpoint

# Conflicts: # pkg/tempopb/tempo.pb.go

Generator is now able to run an embedded OTel collector. This allows to run any number of processors, which includes spanmetrics. This collector uses a custom metrics exporter and a remote write client. Metrics generated by the spanmetrics processor are converted to Prometheus and appended to the remote write client's WAL, which will export the metric data after commit.

Changes: - Remote write was constantly failing with “out of order samples”. Fix: apparently the timestamp has to be in milliseconds, not in seconds. There is not a lot of documentation on the Prometheus interfaces, but the cortex proto has a field timestampMs. - After this remote write was failing with label name \"le\" is not unique: invalid sample. Fix: there was a bug in the append logic of the labels array. Since you were reusing the labels variable in a for-loop, the le label was added multiple times. Initially I thought pushing metrics on every push request was causing the out-of-order issues, so I refactored the processor a bit to have both a PushSpans and a CollectMetrics method. CollectMetrics is called every 15s with a storage.Appender. This way we can also use a single appender for multiple processors, This wasn’t necessary to get it working, but I think this is the logic we want to end up with (i.e. don’t send samples too often to keep DPM low).

# Conflicts: # go.mod # go.sum # pkg/tempopb/tempo.pb.go # pkg/util/test/req.go # vendor/modules.txt

… TODO's around

Histogram buckets and additional dimensions can be configured

Store - make sure tests pass - fix copy-paste error in shouldEvictHead Service graphs - ensure metric names align with agent - server latencies were overwriting client latencies - expose operational metrics

…essor_...

Changes: Service graphs processor - replaced code to manually build metrics with CounterVec and HistogramVec - aligned metric names with the agent - commented out code around collectCh, this was leaking memory Span metrics processor - replaced code to manually build metrics with CounterVec and HistogramVec Introduced Registry to bridge prometheus.Registerer and storage.Appender Add support for configuring external labels Fix generator_test.go

cmd/tempo/app/app.go

modules/distributor/config.go

modules/distributor/distributor.go

joe-elliott

Additional thoughts. Things are looking good. I will say that e2e tests would be nice to have, but not necessary for merging this PR.

modules/generator/instance.go

modules/generator/processor/registry.go

modules/generator/remotewrite/appender.go

modules/generator/generator.go

yvrhdn · 2022-02-17T15:47:21Z

I think we've addressed all review comments now, another review would be welcome 🙂

Agree adding an e2e test would be really nice, this is on our todo list. Not sure yet if we can create one now and that we will PR it later.

joe-elliott

Let's go!

modules/generator/remotewrite/appender.go

annanay25 · 2022-02-18T13:08:50Z

modules/generator/processor/spanmetrics/spanmetrics.go

+	}
+
+	p.spanMetricsCallsTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
+		Namespace: "traces",


should these be namespaced "traces" or "tempo"?

We are using the traces namespace here to match with the metrics generated by the Grafana Agent. If we generate the same metrics, it will be easier for people to reuse queries/dashboards for both.
The service graphs view in Grafana for instance expects metrics with the name traces_service_graph_.... We could make this configurable in Grafana, but for now we want to keep it as simple as possible.

Hmm. Yeah I wish we could make this tempo but I don't see a clear path to making that change. I guess its ok to leave it as is.

modules/generator/processor/spanmetrics/spanmetrics.go

annanay25 · 2022-02-18T13:17:10Z

modules/generator/processor/spanmetrics/spanmetrics.go

+func (p *processor) RegisterMetrics(reg prometheus.Registerer) error {
+	labelNames := []string{"service", "span_name", "span_kind", "span_status"}
+	if len(p.cfg.Dimensions) > 0 {
+		labelNames = append(labelNames, p.cfg.Dimensions...)


I definitely think we need some limits here to keep a check on the number of active series, if we're going to allow custom dimensions. Either here or in appender.go where we are appending labels/samples. Maybe for a future PR though?

Agree we need to limit the amount of series we create because this will impact the memory used by the generators. I wouldn't bother limiting it here (for now) since dimensions is controlled by the operator of Tempo, so it's easy to lower this when the cluster is getting in trouble.
At this point I'm more concerned about the series created by labels like span_name because these are dependent on the cardinality of the traces we are ingesting and we can't control that.

Limiting the active series in the appender makes a lot of sense, both to protect Tempo and to protect the downstream TSDB. I'll add it as a TODO to this PR and will make a tracking issue with all remaining tasks once this PR is merged.

# Conflicts: # CHANGELOG.md # operations/kube-manifests/Deployment-distributor.yaml

yvrhdn and others added 30 commits December 13, 2021 19:29

Add metrics-generator component

80440e6

Add some initial logic to make the generator module optional

4db25c5

Rename tempo_distributor_metrics_generator_clients metric

4864cdb

Periodical check in: add remote-write setup

e931e8f

jsonnet: rename generator -> metrics_generator

c432d1d

Periodical check in: add instance, copy servicegraphprocessor over, a…

336af37

…dd scrape endpoint

Fix make test

f5a62ac

Merge branch 'main' into metrics-generator-dev

0aedf21

# Conflicts: # pkg/tempopb/tempo.pb.go

Implement internal spanmetrics processor

b1143da

Add latency metrics

0ddbd6a

Only loop one for latency metrics

7a3eeed

make vendor-check

9999c93

Merge branch 'main' into metrics-generator-dev

3806bf4

# Conflicts: # go.mod # go.sum # pkg/tempopb/tempo.pb.go # pkg/util/test/req.go # vendor/modules.txt

Merge branch 'main' into metrics-generator-dev - fix tests

510cfed

Add spanmetrics_test.go

2dcd293

jsonnet: add metrics-generator service

135230e

Add distributor.enable_metrics_generator_ring; some bug fixes

30c2846

Remove unused Registerer and scrape endpoint code; sprinkle some more…

55e7864

… TODO's around

Merge branch 'main' into metrics-generator-dev

9a1b52e

Add crude mechanism to trim active series in span metrics processor

d00852e

span metrics: add delete_after_last_update

c7d28ef

Admit config parameters in spanmetrics processor

22f1436

Histogram buckets and additional dimensions can be configured

Initial implementation of service graphs processor

af0ee62

Hook up service graph processor

13c2292

Store - make sure tests pass - fix copy-paste error in shouldEvictHead Service graphs - ensure metric names align with agent - server latencies were overwriting client latencies - expose operational metrics

Rename metrics from generator_processor_... to metrics_generator_proc…

f3c0c90

…essor_...

Set up BasicLifecycler, load balance spans across metrics-generator

476136c

make vendor-check

5d97b44

joe-elliott reviewed Feb 11, 2022

View reviewed changes

cmd/tempo/app/app.go Show resolved Hide resolved

modules/distributor/config.go Outdated Show resolved Hide resolved

modules/distributor/distributor.go Show resolved Hide resolved

Move expire goroutine to collector workers

4225cf6

joe-elliott reviewed Feb 15, 2022

View reviewed changes

yvrhdn mentioned this pull request Feb 15, 2022

Create features section in Tempo config #1287

Open

yvrhdn and others added 7 commits February 16, 2022 17:59

Move metrics_generator_enabled to the top-level config

2c5f6db

Update docker-compose distributor example

4089b52

Use snappy from klauspost/compress

91672a4

Fix distributor test after moving metrics_generator_enabled flag

dfc45f0

Reorganize how processors are updated, simplify use of locks

eaa2604

Stub time in processor tests

5766c7f

Fix how label slices are handled in registry

c7b8af7

yvrhdn requested a review from joe-elliott February 17, 2022 15:47

joe-elliott approved these changes Feb 17, 2022

View reviewed changes

yvrhdn and others added 4 commits February 18, 2022 00:26

jsonnet: disable metrics-generator by default

428f69d

Regenerate kube manifests

114f5a0

Pin time.Now in test with registry

d542274

Update CHANGELOG.md

8c3357d

annanay25 reviewed Feb 18, 2022

View reviewed changes

modules/generator/remotewrite/appender.go Show resolved Hide resolved

annanay25 reviewed Feb 18, 2022

View reviewed changes

modules/generator/processor/spanmetrics/spanmetrics.go Show resolved Hide resolved

annanay25 reviewed Feb 18, 2022

View reviewed changes

Merge branch 'main' into metrics-generator-dev

9f979fb

# Conflicts: # CHANGELOG.md # operations/kube-manifests/Deployment-distributor.yaml

annanay25 approved these changes Feb 21, 2022

View reviewed changes

mapno merged commit f4175e5 into main Feb 21, 2022

mapno deleted the metrics-generator-dev branch February 21, 2022 14:34

mapno mentioned this pull request Feb 23, 2022

Metrics-generator production readiness #1303

Closed

26 tasks

yvrhdn restored the metrics-generator-dev branch February 25, 2022 17:03

nielstenboom mentioned this pull request Jul 18, 2022

[tempo] feat: add metrics generator to chart grafana/helm-charts#1614

Merged

mapno deleted the metrics-generator-dev branch September 6, 2023 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metrics generator] Initial implementation #1282

[Metrics generator] Initial implementation #1282

mapno commented Feb 11, 2022 •

edited by yvrhdn

Loading

joe-elliott left a comment

yvrhdn commented Feb 17, 2022

joe-elliott left a comment

annanay25 Feb 18, 2022

yvrhdn Feb 18, 2022

annanay25 Feb 21, 2022

annanay25 Feb 18, 2022

yvrhdn Feb 18, 2022

[Metrics generator] Initial implementation #1282

[Metrics generator] Initial implementation #1282

Conversation

mapno commented Feb 11, 2022 • edited by yvrhdn Loading

joe-elliott left a comment

Choose a reason for hiding this comment

yvrhdn commented Feb 17, 2022

joe-elliott left a comment

Choose a reason for hiding this comment

annanay25 Feb 18, 2022

Choose a reason for hiding this comment

yvrhdn Feb 18, 2022

Choose a reason for hiding this comment

annanay25 Feb 21, 2022

Choose a reason for hiding this comment

annanay25 Feb 18, 2022

Choose a reason for hiding this comment

yvrhdn Feb 18, 2022

Choose a reason for hiding this comment

mapno commented Feb 11, 2022 •

edited by yvrhdn

Loading