Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add otel-metrics operator for image-based gadgets #2900

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

flyth
Copy link
Member

@flyth flyth commented May 27, 2024

This introduces a new otel-metrics operator for image-based gadgets. This operator can consume and export data from data sources / fields if they're correctly annotated:

  • DataSource needs "metrics.export" set to "true" to be considered by the operator
  • Fields need "metrics.type" set to either "key", "counter", "gauge" or "histogram"
    • counter/gauge/histogram will export those fields using the specified type
    • "key" can be used optionally (and multiple times) to group metrics by the unique set of keys given
  • optionally, "metrics.description" and "metrics.unit" can be used to provide some metadata to the exporter
  • fields without a "metrics.type" will be ignored

To actually enable metrics export, the otel-metrics-name param must be set to the exported name explicitly. This ensures a stable export name. Open Telemetry suggests:

This name uniquely identifies the [instrumentation scope](https://opentelemetry.io/docs/specs/otel/glossary/#instrumentation-scope), such as the [instrumentation library](https://opentelemetry.io/docs/specs/otel/glossary/#instrumentation-library) (e.g. io.opentelemetry.contrib.mongodb), package, module or class name. If an application or library has built-in OpenTelemetry instrumentation, both [Instrumented library](https://opentelemetry.io/docs/specs/otel/glossary/#instrumented-library) and [Instrumentation library](https://opentelemetry.io/docs/specs/otel/glossary/#instrumentation-library) can refer to the same library. In that scenario, the name denotes a module name or component name within that library or application.

I think its good to leave the decision up to the user, since it's effectively them creating the scope by using a specifically configured gadget.

If enabled, the operator will listen on an HTTP port and provide the prometheus-like interface.

A demo to use it from the Go API has been included.

With the recent update to the otel packages, the workarounds for synchronous gauges could finally be removed and the code further simplified.

Functionality that will automatically annotate/export eBPF maps will follow in a separate PR.

Todo

@flyth flyth force-pushed the michael/oci/otel-metrics branch 2 times, most recently from e2aaf57 to 51c13e0 Compare May 27, 2024 17:29
@flyth flyth marked this pull request as ready for review May 28, 2024 07:48
Copy link
Member

@eiffel-fl eiffel-fl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

This is looking great as this is something which would help us with regard to prometheus.
I have nonetheless several comments and I also would like how does it articulate with #2425.
Particularly as this PR adds support for histogram, so I am wondering if I can make use of them to revive the profiler support.

Best regards.

.github/workflows/inspektor-gadget.yml Outdated Show resolved Hide resolved
const (
name = "otel-metrics"

Priority = 50000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should not have a global Priority we share between package and we just increase this value each time for new operator (except those which need to have a very low priority).
Because with all of this hardcoded values, we are taking the highway to future problems and this would not be easy to detect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely agree.

pkg/operators/otel-metrics/otel-metrics.go Show resolved Hide resolved
go func() {
mux := http.NewServeMux()
mux.Handle("/metrics", promhttp.Handler())
err := http.ListenAndServe(globalParams.Get(ParamOtelMetricsListenAddress).AsString(), mux)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this use https?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's using http, which is the default for serving prometheus metrics "internally". We should add an option to be able to use https later on, but that requires certificates to be present and we should plan a proper solution that then could also secure the gRPC endpoints - this is especially tricky on Kubernetes (cert-manager, etc.).

pkg/operators/otel-metrics/otel-metrics.go Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics.go Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics.go Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics.go Outdated Show resolved Hide resolved
uses a counter, gauge and a histogram that emits new values every second. With the provided
annotations, the operator will set those fields up properly and export them.

Note that in this demo no actual gadget is run (thus the image parameter is unused) - instead,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose people would use it with an image?
So, is there any possibilities to craft a toy image to display this?
If this is not that simple, let's stick with what we have right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely - we just don't have any right now, so I kept this as simple as possible to get things started. I'd like to replace it later on once we have a proper example.

@flyth flyth force-pushed the michael/oci/otel-metrics branch from 51c13e0 to 5a1da64 Compare May 28, 2024 18:10
@flyth
Copy link
Member Author

flyth commented May 28, 2024

Thanks for the feedback!

I have nonetheless several comments and I also would like how does it articulate with #2425.
Particularly as this PR adds support for histogram, so I am wondering if I can make use of them to revive the profiler support.

A follow-up PR will bring a generic solution that should also make it possible to implement profiler gadgets. Later on we could add support for printing histograms to the CLI (or another dedicated) operator (but I haven't started working on that one, yet).

@flyth flyth force-pushed the michael/oci/otel-metrics branch from 5a1da64 to 35bfa23 Compare May 30, 2024 06:46
Copy link
Member

@eiffel-fl eiffel-fl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

I only found nits but I would like to get a better understanding of it.
Also another review from someone else more familiar with open telemetry would be welcomed.

Best regards.

pkg/operators/otel-metrics/otel-metrics.go Outdated Show resolved Hide resolved
continue
}

metricsName := gadgetCtx.ImageName() // TODO: make this more unique?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the risk here? That two metrics share the same name?
If so, just happen some random number or have a static int.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually if you're exporting/collecting metrics, you want to have the name stable in order to have a continuous series of data. If you're running multiple gadgets of the same type with different configs under the same name (same as not configured in the current implementation), that will most likely skew the data. The uniqueness could be achieved by appending for example part of the hash of the configuration. Just counting up wouldn't be stable, especially in the kubernetes environment (each node could have different counters).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uniqueness could be achieved by appending for example part of the hash of the configuration.

OK, but shouldn't we rather explode if the different configs have the same name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be hard to keep track of the names, especially again in a kubernetes environment. But we could make the name mandatory, so that it can't happen by accident (or go with the hash route).

I should probably also add a parameter so that one can override the name per gadget run and not have it tied only to the gadget.yaml.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be hard to keep track of the names, especially again in a kubernetes environment. But we could make the name mandatory, so that it can't happen by accident (or go with the hash route).

We can go with adding the hash, but we should print a warning to inform users we change the name to make it unique, so no surprises on their side.

I should probably also add a parameter so that one can override the name per gadget run and not have it tied only to the gadget.yaml.

This seems a good idea!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the behavior slightly. Now the annotation metrics.export (previously metrics.enable) tells the exporter to consider a datasource for exporting. Setting the name using the parameter is now mandatory. This helps preventing accidental duplicates IMHO.

@flyth flyth force-pushed the michael/oci/otel-metrics branch 3 times, most recently from 382f825 to 0b574d7 Compare May 30, 2024 17:03
Copy link
Member

@mauriciovasquezbernal mauriciovasquezbernal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking very good. I'm looking forward to see how it can be used from the gadgets.

examples/gadgets/operators/otel_metrics/README.md Outdated Show resolved Hide resolved
examples/gadgets/operators/otel_metrics/.gitignore Outdated Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics.go Outdated Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics.go Outdated Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics.go Outdated Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics_test.go Outdated Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics_test.go Outdated Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics.go Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics_test.go Outdated Show resolved Hide resolved
ctx, cancel := context.WithTimeout(context.TODO(), time.Second)
defer cancel()

expectedBuckets := []uint64{0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 3, 2, 3, 0}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use AnnotationMetricsBoundaries to also test it?

@flyth flyth force-pushed the michael/oci/otel-metrics branch 3 times, most recently from db181ef to 74cf555 Compare June 21, 2024 09:38
Copy link
Member

@eiffel-fl eiffel-fl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

I have some comments.

Best regards.

switch f.Type() {
default:
// This should not be called with types other than int
panic("unsupported field type for asInt64")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have this function return an error instead.
Moreover, you would then be able to return the underlying error of calling extract() some lines above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a code path that shouldn't be used (validation should happen (and happens) in the callers of this function) - that's why I opted not to have runtime error handling here. Should this ever be used, it will let us know immediately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a code path that shouldn't be used

If this is not used, then simply remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have these "asserts" in (that would crash and instantly notify us), rather than just returning with an unexpected 0 in those cases (should anyone ever use it in a wrong way.)

switch f.Type() {
default:
// This should not be called with types other than float
panic("unsupported field type for asFloat4")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment.

pkg/operators/otel-metrics/otel-metrics.go Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics.go Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics.go Outdated Show resolved Hide resolved
pkg/operators/otel-metrics/otel-metrics.go Outdated Show resolved Hide resolved
This adds a new operator that - based on data source and field annotations - can export
fields to an Open Telemetry metrics interface (Prometheus).

Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants