Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics generator] Initial implementation #1282

Merged
merged 82 commits into from
Feb 21, 2022
Merged

[Metrics generator] Initial implementation #1282

merged 82 commits into from
Feb 21, 2022

Conversation

mapno
Copy link
Member

@mapno mapno commented Feb 11, 2022

What this PR does:

This PR contributes the first implementation of a new component: the metrics-generator. This implementation is based on the design proposal: metrics generator

The metrics-generator is a new optional component, that sits in the write path and derives metrics from traces. If present, the distributor will write received spans to both the ingester and the metrics-generator. The metrics-generator processes spans and writes metrics to a Prometheus datasource using the Prometheus remote write protocol.

You can see an architectural diagram below:

                                                                      │
                                                                      │
                                                                   Ingress
                                                                      │
                                                                      ▼
                                                          ┌──────────────────────┐
                                                          │                      │
                                                          │     Distributor      │
                                                          │                      │
                                                          └──────────────────────┘
                                                                    2│ │1
                                                                     │ │
                                                  ┌──────────────────┘ └────────┐
                                                  │                             │
                                                  ▼                             ▼
┌ ─ ─ ─ ─ ─ ─ ─ ─                     ┏━━━━━━━━━━━━━━━━━━━━━━┓      ┌──────────────────────┐
                 │                    ┃                      ┃      │                      │
│   Prometheus    ◀────Prometheus ────┃  Metrics-generator   ┃      │       Ingester       │◀───Queries────
                 │    Remote Write    ┃                      ┃      │                      │
└ ─ ─ ─ ─ ─ ─ ─ ─                     ┗━━━━━━━━━━━━━━━━━━━━━━┛      └──────────────────────┘
                                                                                │
                                                                                │
                                                                                │
                                                                                ▼
                                                                       ┌─────────────────┐
                                                                       │                 │
                                                                       │     Backend     │
                                                                       │                 │
                                                                       └─────────────────┘

Initially, the metrics-generator will be able to derive two different sets of metrics: span-metrics and service graphs.

  • Span-metrics: aggregates request, error and duration metrics (RED) from span data. This processor computes the total count and the duration of spans for every unique combination of dimensions. Dimensions can be the service name, the operation, the span kind, the status code and any tag or attribute present in the span.
  • Service graphs: A service graph is a visual representation of the interrelationships between various services. The service graph processor will analyse trace data and generate metrics describing the relationship between the services. These metrics can be used by e.g. Grafana to draw a service graph. For a more in-depth look, check in here

We encourage everyone to take a look. All feedback is highly appreciated. Feel free to do so in this PR, or by pinging us (@kvrhdn and @mapno) at Slack.

Known issues / TODOs

A list of open tasks that we will tackle after this PR is merged and before the next Tempo release. Once this PR is merged we can create dedicated issues fort them.

  • The distributor spawns a new goroutine for every push, we should optimize this by for instance introducing a queue. We want a way to limit the amount of pending requests to the metrics-generator in case they are having issues and requests start backing up.

  • When generating metrics, we should take the timestamp of the received span into account. Spans can take a long time to reach Tempo (requests can be retried multiple times etc.). The timestamp could even be in the future. We should either figure out a way to correct the metrics that have been sent or refuse too old/new spans.

  • We should add controls to limit the amount of series we generate/remote write.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

yvrhdn and others added 30 commits December 13, 2021 19:29
# Conflicts:
#	pkg/tempopb/tempo.pb.go
Generator is now able to run an embedded OTel collector.
This allows to run any number of processors, which includes spanmetrics.

This collector uses a custom metrics exporter and a remote write client.
Metrics generated by the spanmetrics processor are converted to Prometheus
and appended to the remote write client's WAL, which will export the metric
data after commit.
Changes:
- Remote write was constantly failing with “out of order samples”. Fix: apparently the timestamp has to be in milliseconds, not in seconds. There is not a lot of documentation on the Prometheus interfaces, but the cortex proto has a field timestampMs.
- After this remote write was failing with label name \"le\" is not unique: invalid sample. Fix: there was a bug in the append logic of the labels array. Since you were reusing the labels variable in a for-loop, the le label was added multiple times.

Initially I thought pushing metrics on every push request was causing the out-of-order issues, so I refactored the processor a bit to have both a PushSpans and a CollectMetrics method. CollectMetrics is called every 15s with a storage.Appender. This way we can also use a single appender for multiple processors,
This wasn’t necessary to get it working, but I think this is the logic we want to end up with (i.e. don’t send samples too often to keep DPM low).
# Conflicts:
#	go.mod
#	go.sum
#	pkg/tempopb/tempo.pb.go
#	pkg/util/test/req.go
#	vendor/modules.txt
Histogram buckets and additional dimensions can be configured
Store
- make sure tests pass
- fix copy-paste error in shouldEvictHead

Service graphs
- ensure metric names align with agent
- server latencies were overwriting client latencies
- expose operational metrics
Changes:

Service graphs processor
- replaced code to manually build metrics with CounterVec and HistogramVec
- aligned metric names with the agent
- commented out code around collectCh, this was leaking memory

Span metrics processor
- replaced code to manually build metrics with CounterVec and HistogramVec

Introduced Registry to bridge prometheus.Registerer and storage.Appender

Add support for configuring external labels

Fix generator_test.go
cmd/tempo/app/app.go Show resolved Hide resolved
modules/distributor/config.go Outdated Show resolved Hide resolved
modules/distributor/distributor.go Show resolved Hide resolved
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional thoughts. Things are looking good. I will say that e2e tests would be nice to have, but not necessary for merging this PR.

modules/generator/instance.go Outdated Show resolved Hide resolved
modules/generator/instance.go Outdated Show resolved Hide resolved
modules/generator/instance.go Outdated Show resolved Hide resolved
modules/generator/processor/registry.go Outdated Show resolved Hide resolved
modules/generator/remotewrite/appender.go Show resolved Hide resolved
modules/generator/generator.go Show resolved Hide resolved
@yvrhdn
Copy link
Member

yvrhdn commented Feb 17, 2022

I think we've addressed all review comments now, another review would be welcome 🙂

Agree adding an e2e test would be really nice, this is on our todo list. Not sure yet if we can create one now and that we will PR it later.

Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go!

}

p.spanMetricsCallsTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "traces",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be namespaced "traces" or "tempo"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using the traces namespace here to match with the metrics generated by the Grafana Agent. If we generate the same metrics, it will be easier for people to reuse queries/dashboards for both.
The service graphs view in Grafana for instance expects metrics with the name traces_service_graph_.... We could make this configurable in Grafana, but for now we want to keep it as simple as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Yeah I wish we could make this tempo but I don't see a clear path to making that change. I guess its ok to leave it as is.

func (p *processor) RegisterMetrics(reg prometheus.Registerer) error {
labelNames := []string{"service", "span_name", "span_kind", "span_status"}
if len(p.cfg.Dimensions) > 0 {
labelNames = append(labelNames, p.cfg.Dimensions...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely think we need some limits here to keep a check on the number of active series, if we're going to allow custom dimensions. Either here or in appender.go where we are appending labels/samples. Maybe for a future PR though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we need to limit the amount of series we create because this will impact the memory used by the generators. I wouldn't bother limiting it here (for now) since dimensions is controlled by the operator of Tempo, so it's easy to lower this when the cluster is getting in trouble.
At this point I'm more concerned about the series created by labels like span_name because these are dependent on the cardinality of the traces we are ingesting and we can't control that.

Limiting the active series in the appender makes a lot of sense, both to protect Tempo and to protect the downstream TSDB. I'll add it as a TODO to this PR and will make a tracking issue with all remaining tasks once this PR is merged.

# Conflicts:
#	CHANGELOG.md
#	operations/kube-manifests/Deployment-distributor.yaml
@mapno mapno merged commit f4175e5 into main Feb 21, 2022
@mapno mapno deleted the metrics-generator-dev branch February 21, 2022 14:34
@mapno mapno mentioned this pull request Feb 23, 2022
26 tasks
@yvrhdn yvrhdn restored the metrics-generator-dev branch February 25, 2022 17:03
@mapno mapno deleted the metrics-generator-dev branch September 6, 2023 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants