-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add otel-metrics operator for image-based gadgets #2900
base: main
Are you sure you want to change the base?
Conversation
e2aaf57
to
51c13e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi!
This is looking great as this is something which would help us with regard to prometheus.
I have nonetheless several comments and I also would like how does it articulate with #2425.
Particularly as this PR adds support for histogram, so I am wondering if I can make use of them to revive the profiler support.
Best regards.
const ( | ||
name = "otel-metrics" | ||
|
||
Priority = 50000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should not have a global Priority we share between package and we just increase this value each time for new operator (except those which need to have a very low priority).
Because with all of this hardcoded values, we are taking the highway to future problems and this would not be easy to detect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely agree.
go func() { | ||
mux := http.NewServeMux() | ||
mux.Handle("/metrics", promhttp.Handler()) | ||
err := http.ListenAndServe(globalParams.Get(ParamOtelMetricsListenAddress).AsString(), mux) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this use https?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's using http, which is the default for serving prometheus metrics "internally". We should add an option to be able to use https later on, but that requires certificates to be present and we should plan a proper solution that then could also secure the gRPC endpoints - this is especially tricky on Kubernetes (cert-manager, etc.).
uses a counter, gauge and a histogram that emits new values every second. With the provided | ||
annotations, the operator will set those fields up properly and export them. | ||
|
||
Note that in this demo no actual gadget is run (thus the image parameter is unused) - instead, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose people would use it with an image?
So, is there any possibilities to craft a toy image to display this?
If this is not that simple, let's stick with what we have right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely - we just don't have any right now, so I kept this as simple as possible to get things started. I'd like to replace it later on once we have a proper example.
51c13e0
to
5a1da64
Compare
Thanks for the feedback!
A follow-up PR will bring a generic solution that should also make it possible to implement profiler gadgets. Later on we could add support for printing histograms to the CLI (or another dedicated) operator (but I haven't started working on that one, yet). |
5a1da64
to
35bfa23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi!
I only found nits but I would like to get a better understanding of it.
Also another review from someone else more familiar with open telemetry would be welcomed.
Best regards.
continue | ||
} | ||
|
||
metricsName := gadgetCtx.ImageName() // TODO: make this more unique? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the risk here? That two metrics share the same name?
If so, just happen some random number or have a static int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually if you're exporting/collecting metrics, you want to have the name stable in order to have a continuous series of data. If you're running multiple gadgets of the same type with different configs under the same name (same as not configured in the current implementation), that will most likely skew the data. The uniqueness could be achieved by appending for example part of the hash of the configuration. Just counting up wouldn't be stable, especially in the kubernetes environment (each node could have different counters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uniqueness could be achieved by appending for example part of the hash of the configuration.
OK, but shouldn't we rather explode if the different configs have the same name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be hard to keep track of the names, especially again in a kubernetes environment. But we could make the name mandatory, so that it can't happen by accident (or go with the hash route).
I should probably also add a parameter so that one can override the name per gadget run and not have it tied only to the gadget.yaml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be hard to keep track of the names, especially again in a kubernetes environment. But we could make the name mandatory, so that it can't happen by accident (or go with the hash route).
We can go with adding the hash, but we should print a warning to inform users we change the name to make it unique, so no surprises on their side.
I should probably also add a parameter so that one can override the name per gadget run and not have it tied only to the
gadget.yaml
.
This seems a good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the behavior slightly. Now the annotation metrics.export
(previously metrics.enable
) tells the exporter to consider a datasource for exporting. Setting the name using the parameter is now mandatory. This helps preventing accidental duplicates IMHO.
382f825
to
0b574d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's looking very good. I'm looking forward to see how it can be used from the gadgets.
ctx, cancel := context.WithTimeout(context.TODO(), time.Second) | ||
defer cancel() | ||
|
||
expectedBuckets := []uint64{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 3, 2, 3, 0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps use AnnotationMetricsBoundaries to also test it?
db181ef
to
74cf555
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi!
I have some comments.
Best regards.
switch f.Type() { | ||
default: | ||
// This should not be called with types other than int | ||
panic("unsupported field type for asInt64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have this function return an error instead.
Moreover, you would then be able to return the underlying error of calling extract()
some lines above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a code path that shouldn't be used (validation should happen (and happens) in the callers of this function) - that's why I opted not to have runtime error handling here. Should this ever be used, it will let us know immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a code path that shouldn't be used
If this is not used, then simply remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to have these "asserts" in (that would crash and instantly notify us), rather than just returning with an unexpected 0 in those cases (should anyone ever use it in a wrong way.)
switch f.Type() { | ||
default: | ||
// This should not be called with types other than float | ||
panic("unsupported field type for asFloat4") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment.
74cf555
to
76340db
Compare
This adds a new operator that - based on data source and field annotations - can export fields to an Open Telemetry metrics interface (Prometheus). Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
76340db
to
4960c8b
Compare
This introduces a new otel-metrics operator for image-based gadgets. This operator can consume and export data from data sources / fields if they're correctly annotated:
To actually enable metrics export, the
otel-metrics-name
param must be set to the exported name explicitly. This ensures a stable export name. Open Telemetry suggests:I think its good to leave the decision up to the user, since it's effectively them creating the scope by using a specifically configured gadget.
If enabled, the operator will listen on an HTTP port and provide the prometheus-like interface.
A demo to use it from the Go API has been included.
With the recent update to the otel packages, the workarounds for synchronous gauges could finally be removed and the code further simplified.
Functionality that will automatically annotate/export eBPF maps will follow in a separate PR.
Todo