Add new histogram to generator - messaging system latency #3453

adirmatzkin · 2024-03-03T16:15:57Z

What this PR does:
This PR adds a new metric calculated by the metrics-generator service-graph processor.
The metric is a histogram for the latency between 2 services communicating through a messaging system.

Which issue(s) this PR fixes:
Fixes #3232

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

CLAassistant · 2024-03-03T16:16:03Z

All committers have signed the CLA.

joe-elliott · 2024-03-04T21:59:29Z

Thanks for putting some time into this. It's a fairly small change should be a pretty easy review, but we do need this to be configurable.

We are currently trying to keep metrics generator series down and histograms tend to generate a lot of series.

I will give this more time tomorrow.

adirmatzkin · 2024-03-05T19:27:05Z

We are currently trying to keep metrics generator series down and histograms tend to generate a lot of series.

Just pushed some changes to allow controlling the metric from config.
Not sure I did it in all the relevant places... saw it working in a specific config change I tested though.

Also, the metric is initialized even if the config is set to false... probably not the best, but it felt simpler due to the current structure of the code... Is that fine? Maybe you guys have a different idea?

joe-elliott

a few questions, but looking good!

also, we will need a changelog

modules/generator/overrides.go

joe-elliott · 2024-03-05T20:00:24Z

modules/generator/processor/servicegraphs/servicegraphs.go

@@ -315,8 +329,14 @@ func (p *Processor) spanFailed(span *v1_trace.Span) bool {
 	return span.GetStatus().GetCode() == v1_trace.Status_STATUS_CODE_ERROR
 }

+func unixNanosDiffSec(unixNanoStart uint64, unixNanoEnd uint64) float64 {
+	// handling potential underflow of unit64s substraction
+	diff := int64(unixNanoEnd) - int64(unixNanoStart)


instead of casting we could check to see if diff > end for overflow. is that better?

also, we recently made a change to protect span metrics from negative latencies:

https://github.com/grafana/tempo/pull/3412/files

@mdisibio should we make sure the same protections are here?

instead of casting we could check to see if diff > end for overflow. is that better?

I agree we can make it cleaner,
Really liked your idea, very clever 🙃 My only concern about it is how clear it will be. After making the check if diff > end, if underflow did happen, meaning the subtraction yields a negative number - don't we want to return it?

If not, maybe we could make this function a bit more specific and move the negative values "protection" into it.. and then.. return 0?
It could also be that simple:

func unixNanosDiffSec(unixNanoStart uint64, unixNanoEnd uint64) float64 { if unixNanoStart > unixNanoEnd { // To prevent underflow, return 0. return 0 } // Safe subtraction. return float64(unixNanoEnd-unixNanoStart) / 1e9 }

let's follow suit with the linked PR for consistency and return 0.

i think your suggested unixNanosDiffSec in this PR is perfect.

so this one is resolved? 🤔

modules/generator/processor/servicegraphs/servicegraphs.go

adirmatzkin

Added to changelog.
Will update docs after making a decision about the "negative diffs"

adirmatzkin · 2024-03-05T22:38:15Z

modules/generator/processor/servicegraphs/servicegraphs.go

@@ -315,8 +329,14 @@ func (p *Processor) spanFailed(span *v1_trace.Span) bool {
 	return span.GetStatus().GetCode() == v1_trace.Status_STATUS_CODE_ERROR
 }

+func unixNanosDiffSec(unixNanoStart uint64, unixNanoEnd uint64) float64 {
+	// handling potential underflow of unit64s substraction
+	diff := int64(unixNanoEnd) - int64(unixNanoStart)


instead of casting we could check to see if diff > end for overflow. is that better?

I agree we can make it cleaner,
Really liked your idea, very clever 🙃 My only concern about it is how clear it will be. After making the check if diff > end, if underflow did happen, meaning the subtraction yields a negative number - don't we want to return it?

If not, maybe we could make this function a bit more specific and move the negative values "protection" into it.. and then.. return 0?
It could also be that simple:

func unixNanosDiffSec(unixNanoStart uint64, unixNanoEnd uint64) float64 { if unixNanoStart > unixNanoEnd { // To prevent underflow, return 0. return 0 } // Safe subtraction. return float64(unixNanoEnd-unixNanoStart) / 1e9 }

joe-elliott · 2024-03-06T22:22:10Z

Thanks for the work. This is looking good.

heads up you will need this commit in your branch (merge or rebase) to get CI to pass:

7540d57

adirmatzkin · 2024-03-10T13:38:37Z

heads up you will need this commit in your branch (merge or rebase) to get CI to pass:

Synced from upstream + added docs.
@joe-elliott anything else needed? 🙃

joe-elliott

Looking good, just have a few small thoughts.

Also, what do you think about adding the ConnectionType to the expired edge metric? If someone is wanting to do service graph metrics through queues I'm concerned it will create a lot of expired edges. It might be nice for an operator to know if their expired edges are largely queues or synchronous calls.

docs/sources/tempo/configuration/_index.md

docs/sources/tempo/metrics-generator/service_graphs/_index.md

modules/generator/processor/servicegraphs/servicegraphs.go

docs/sources/tempo/metrics-generator/service_graphs/estimate-cardinality.md

…ardinality.md use present when possible Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>

adirmatzkin

Ready to merge this one? @joe-elliott
Just rebased 🙃

modules/generator/processor/servicegraphs/servicegraphs.go

knylander-grafana · 2024-04-29T22:37:58Z

Thank you for updating the docs

adirmatzkin · 2024-05-03T13:04:50Z

@joe-elliott I think we're ready 🥳

joe-elliott · 2024-05-03T14:12:51Z

running CI!

joe-elliott · 2024-05-03T16:49:19Z

Thanks, Adir!

* introduce new service-graph metric for messaging-system latency * added tests for new histogram values * fix linting * make new metric optional via config * fix typo * fix failing tests * add feature to changelog * negative times diff consistency - return 0 instead of negative * update docs * Update docs/sources/tempo/metrics-generator/service_graphs/estimate-cardinality.md use present when possible Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com> * change 1e9 to time const * added a reference to the "wait" config of the processor * fixed indentations and formatting stuff from rebasing * removed mistaken println found by linter --------- Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>

adirmatzkin requested review from joe-elliott, annanay25, mdisibio, mapno, kvrhdn, zalegrala, electron0zero, ie-pham and stoewer as code owners March 3, 2024 16:15

adirmatzkin changed the title ~~Generator messaging system latency~~ Add new histogram to generator - messaging system latency Mar 3, 2024

joe-elliott reviewed Mar 5, 2024

View reviewed changes

adirmatzkin commented Mar 5, 2024

View reviewed changes

adirmatzkin requested a review from knylander-grafana as a code owner March 10, 2024 13:34

joe-elliott reviewed Mar 11, 2024

View reviewed changes

joe-elliott mentioned this pull request Mar 18, 2024

The service graph supports nodes of the message queue middleware type #3348

Open

knylander-grafana reviewed Mar 20, 2024

View reviewed changes

docs/sources/tempo/metrics-generator/service_graphs/estimate-cardinality.md Outdated Show resolved Hide resolved

knylander-grafana reviewed Mar 25, 2024

View reviewed changes

docs/sources/tempo/metrics-generator/service_graphs/estimate-cardinality.md Show resolved Hide resolved

adirmatzkin added 8 commits April 25, 2024 18:05

introduce new service-graph metric for messaging-system latency

28b43cf

added tests for new histogram values

59f608c

fix linting

5a87896

make new metric optional via config

c96d81a

fix typo

9fbff48

fix failing tests

c9df5db

add feature to changelog

814d78d

negative times diff consistency - return 0 instead of negative

54e1052

adirmatzkin and others added 4 commits April 25, 2024 18:57

update docs

92926be

Update docs/sources/tempo/metrics-generator/service_graphs/estimate-c…

1256bf9

…ardinality.md use present when possible Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>

change 1e9 to time const

14789aa

added a reference to the "wait" config of the processor

7121ecb

adirmatzkin force-pushed the generator-messaging-system-latency branch from 0b29823 to 7121ecb Compare April 25, 2024 16:12

Merge branch 'grafana:main' into generator-messaging-system-latency

8d34046

adirmatzkin commented Apr 25, 2024

View reviewed changes

adirmatzkin mentioned this pull request Apr 28, 2024

Service-graph metrics for messaging_system connection type: Taking into account network+middleware latency #3232

Closed

joe-elliott reviewed Apr 29, 2024

View reviewed changes

modules/generator/processor/servicegraphs/servicegraphs.go Show resolved Hide resolved

adirmatzkin added 2 commits May 3, 2024 15:53

Merge branch 'main' into generator-messaging-system-latency

74153d7

fixed indentations and formatting stuff from rebasing

aece8fe

removed mistaken println found by linter

51ae072

joe-elliott approved these changes May 3, 2024

View reviewed changes

joe-elliott merged commit 35aa72e into grafana:main May 3, 2024
15 checks passed

fyuan1316 mentioned this pull request May 15, 2024

[connector/servicegraph] The servicegraph supports nodes of the message queue middleware type open-telemetry/opentelemetry-collector-contrib#30856

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new histogram to generator - messaging system latency #3453

Add new histogram to generator - messaging system latency #3453

adirmatzkin commented Mar 3, 2024 •

edited

CLAassistant commented Mar 3, 2024 •

edited

joe-elliott commented Mar 4, 2024 •

edited

adirmatzkin commented Mar 5, 2024

joe-elliott left a comment

joe-elliott Mar 5, 2024

joe-elliott Mar 5, 2024

adirmatzkin Mar 5, 2024

joe-elliott Mar 6, 2024

adirmatzkin Mar 25, 2024

adirmatzkin left a comment •

edited

adirmatzkin Mar 5, 2024

joe-elliott commented Mar 6, 2024

adirmatzkin commented Mar 10, 2024

joe-elliott left a comment

adirmatzkin left a comment

knylander-grafana commented Apr 29, 2024

adirmatzkin commented May 3, 2024

joe-elliott commented May 3, 2024

joe-elliott commented May 3, 2024

Add new histogram to generator - messaging system latency #3453

Add new histogram to generator - messaging system latency #3453

Conversation

adirmatzkin commented Mar 3, 2024 • edited

CLAassistant commented Mar 3, 2024 • edited

joe-elliott commented Mar 4, 2024 • edited

adirmatzkin commented Mar 5, 2024

joe-elliott left a comment

Choose a reason for hiding this comment

joe-elliott Mar 5, 2024

Choose a reason for hiding this comment

joe-elliott Mar 5, 2024

Choose a reason for hiding this comment

adirmatzkin Mar 5, 2024

Choose a reason for hiding this comment

joe-elliott Mar 6, 2024

Choose a reason for hiding this comment

adirmatzkin Mar 25, 2024

Choose a reason for hiding this comment

adirmatzkin left a comment • edited

Choose a reason for hiding this comment

adirmatzkin Mar 5, 2024

Choose a reason for hiding this comment

joe-elliott commented Mar 6, 2024

adirmatzkin commented Mar 10, 2024

joe-elliott left a comment

Choose a reason for hiding this comment

adirmatzkin left a comment

Choose a reason for hiding this comment

knylander-grafana commented Apr 29, 2024

adirmatzkin commented May 3, 2024

joe-elliott commented May 3, 2024

joe-elliott commented May 3, 2024

adirmatzkin commented Mar 3, 2024 •

edited

CLAassistant commented Mar 3, 2024 •

edited

joe-elliott commented Mar 4, 2024 •

edited

adirmatzkin left a comment •

edited