Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new histogram to generator - messaging system latency #3453

Merged

Conversation

adirmatzkin
Copy link
Contributor

@adirmatzkin adirmatzkin commented Mar 3, 2024

What this PR does:
This PR adds a new metric calculated by the metrics-generator service-graph processor.
The metric is a histogram for the latency between 2 services communicating through a messaging system.

Which issue(s) this PR fixes:
Fixes #3232

Screenshot 2024-03-03 18 14 48

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@CLAassistant
Copy link

CLAassistant commented Mar 3, 2024

CLA assistant check
All committers have signed the CLA.

@adirmatzkin adirmatzkin changed the title Generator messaging system latency Add new histogram to generator - messaging system latency Mar 3, 2024
@joe-elliott
Copy link
Member

joe-elliott commented Mar 4, 2024

Thanks for putting some time into this. It's a fairly small change should be a pretty easy review, but we do need this to be configurable.

We are currently trying to keep metrics generator series down and histograms tend to generate a lot of series.

I will give this more time tomorrow.

@adirmatzkin
Copy link
Contributor Author

We are currently trying to keep metrics generator series down and histograms tend to generate a lot of series.

Just pushed some changes to allow controlling the metric from config.
Not sure I did it in all the relevant places... saw it working in a specific config change I tested though.

Also, the metric is initialized even if the config is set to false... probably not the best, but it felt simpler due to the current structure of the code... Is that fine? Maybe you guys have a different idea?

Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few questions, but looking good!

also, we will need a changelog

modules/generator/overrides.go Show resolved Hide resolved
@@ -315,8 +329,14 @@ func (p *Processor) spanFailed(span *v1_trace.Span) bool {
return span.GetStatus().GetCode() == v1_trace.Status_STATUS_CODE_ERROR
}

func unixNanosDiffSec(unixNanoStart uint64, unixNanoEnd uint64) float64 {
// handling potential underflow of unit64s substraction
diff := int64(unixNanoEnd) - int64(unixNanoStart)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of casting we could check to see if diff > end for overflow. is that better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, we recently made a change to protect span metrics from negative latencies:

https://github.com/grafana/tempo/pull/3412/files

@mdisibio should we make sure the same protections are here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of casting we could check to see if diff > end for overflow. is that better?

I agree we can make it cleaner,
Really liked your idea, very clever 🙃 My only concern about it is how clear it will be. After making the check if diff > end, if underflow did happen, meaning the subtraction yields a negative number - don't we want to return it?

If not, maybe we could make this function a bit more specific and move the negative values "protection" into it.. and then.. return 0?
It could also be that simple:

func unixNanosDiffSec(unixNanoStart uint64, unixNanoEnd uint64) float64 {
	if unixNanoStart > unixNanoEnd {
		// To prevent underflow, return 0.
		return 0
	}
	// Safe subtraction.
	return float64(unixNanoEnd-unixNanoStart) / 1e9
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's follow suit with the linked PR for consistency and return 0.

i think your suggested unixNanosDiffSec in this PR is perfect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this one is resolved? 🤔

Copy link
Contributor Author

@adirmatzkin adirmatzkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to changelog.
Will update docs after making a decision about the "negative diffs"

@@ -315,8 +329,14 @@ func (p *Processor) spanFailed(span *v1_trace.Span) bool {
return span.GetStatus().GetCode() == v1_trace.Status_STATUS_CODE_ERROR
}

func unixNanosDiffSec(unixNanoStart uint64, unixNanoEnd uint64) float64 {
// handling potential underflow of unit64s substraction
diff := int64(unixNanoEnd) - int64(unixNanoStart)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of casting we could check to see if diff > end for overflow. is that better?

I agree we can make it cleaner,
Really liked your idea, very clever 🙃 My only concern about it is how clear it will be. After making the check if diff > end, if underflow did happen, meaning the subtraction yields a negative number - don't we want to return it?

If not, maybe we could make this function a bit more specific and move the negative values "protection" into it.. and then.. return 0?
It could also be that simple:

func unixNanosDiffSec(unixNanoStart uint64, unixNanoEnd uint64) float64 {
	if unixNanoStart > unixNanoEnd {
		// To prevent underflow, return 0.
		return 0
	}
	// Safe subtraction.
	return float64(unixNanoEnd-unixNanoStart) / 1e9
}

@joe-elliott
Copy link
Member

Thanks for the work. This is looking good.

heads up you will need this commit in your branch (merge or rebase) to get CI to pass:

7540d57

@adirmatzkin
Copy link
Contributor Author

heads up you will need this commit in your branch (merge or rebase) to get CI to pass:

Synced from upstream + added docs.
@joe-elliott anything else needed? 🙃

Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just have a few small thoughts.

Also, what do you think about adding the ConnectionType to the expired edge metric? If someone is wanting to do service graph metrics through queues I'm concerned it will create a lot of expired edges. It might be nice for an operator to know if their expired edges are largely queues or synchronous calls.

adirmatzkin and others added 4 commits April 25, 2024 18:57
…ardinality.md

use present when possible

Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>
@adirmatzkin adirmatzkin force-pushed the generator-messaging-system-latency branch from 0b29823 to 7121ecb Compare April 25, 2024 16:12
Copy link
Contributor Author

@adirmatzkin adirmatzkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge this one? @joe-elliott
Just rebased 🙃

@knylander-grafana
Copy link
Contributor

Thank you for updating the docs

@adirmatzkin
Copy link
Contributor Author

@joe-elliott I think we're ready 🥳

@joe-elliott
Copy link
Member

running CI!

@joe-elliott
Copy link
Member

Thanks, Adir!

@joe-elliott joe-elliott merged commit 35aa72e into grafana:main May 3, 2024
15 checks passed
joe-elliott pushed a commit to joe-elliott/tempo that referenced this pull request May 7, 2024
* introduce new service-graph metric for messaging-system latency

* added tests for new histogram values

* fix linting

* make new metric optional via config

* fix typo

* fix failing tests

* add feature to changelog

* negative times diff consistency - return 0 instead of negative

* update docs

* Update docs/sources/tempo/metrics-generator/service_graphs/estimate-cardinality.md

use present when possible

Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>

* change 1e9 to time const

* added a reference to the "wait" config of the processor

* fixed indentations and formatting stuff from rebasing

* removed mistaken println found by linter

---------

Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Service-graph metrics for messaging_system connection type: Taking into account network+middleware latency
4 participants