Have OTel TC review metric SDK and sign-off on stable release #3674

MrAlias · 2023-02-03T16:39:27Z

Blocked by:

Verify compliant metric SDK specification implementation: MeterProvider #3640
Verify compliant metric SDK specification implementation: MeterProvider/MeterProvider Creation #3641
Verify compliant metric SDK specification implementation: MeterProvider/Meter Creation #3642
Verify compliant metric SDK specification implementation: MeterProvider/Shutdown #3643
Verify compliant metric SDK specification implementation: MeterProvider/ForceFlush #3644
Verify compliant metric SDK specification implementation: MeterProvider/View #3645
Verify compliant metric SDK specification implementation: MeterProvider/Aggregation #3646
Verify compliant metric SDK specification implementation: MeterProvider/Aggregation/Drop Aggregation #3647
Verify compliant metric SDK specification implementation: MeterProvider/Aggregation/Default Aggregation #3648
Verify compliant metric SDK specification implementation: MeterProvider/Aggregation/Sum Aggregation #3649
Verify compliant metric SDK specification implementation: MeterProvider/Aggregation/Last Value Aggregation #3650
Verify compliant metric SDK specification implementation: MeterProvider/Aggregation/Histogram Aggregations #3651
Verify compliant metric SDK specification implementation: MeterProvider/Observations inside asynchronous callbacks #3652
Verify compliant metric SDK specification implementation: MeterProvider/Resolving duplicate instrument registration conflicts #3653
Verify compliant metric SDK specification implementation: Instrument name #3654
Verify compliant metric SDK specification implementation: Exemplars #3655
Verify compliant metric SDK specification implementation: Exemplars/ExemplarFilter #3656
Verify compliant metric SDK specification implementation: Exemplars/Built-in ExemplarFilters #3657
Verify compliant metric SDK specification implementation: Exemplars/ExemplarReservoir #3658
Verify compliant metric SDK specification implementation: ExemplarsExemplar defaults #3659
Verify compliant metric SDK specification implementation: MetricReader #3660
Verify compliant metric SDK specification implementation: MetricReader/MetricReader operations/RegisterProducer(metricProducer) #3661
Verify compliant metric SDK specification implementation: MetricReader/MetricReader operations/Collect #3662
Verify compliant metric SDK specification implementation: MetricReader/MetricReader operations/Shutdown #3663
Verify compliant metric SDK specification implementation: MetricReader/Periodic exporting MetricReader #3664
Verify compliant metric SDK specification implementation: MetricExporter #3665
Verify compliant metric SDK specification implementation: MetricExporter/Push Metric Exporter #3666
Verify compliant metric SDK specification implementation: MetricExporter/Pull Metric Exporter #3667
Verify compliant metric SDK specification implementation: MetricProducer #3668
Verify compliant metric SDK specification implementation: Defaults and configuration #3669
Verify compliant metric SDK specification implementation: Numerical limits handling #3670
Verify compliant metric SDK specification implementation: Compatibility requirements #3671
Verify compliant metric SDK specification implementation: Concurrency requirements #3672

Create an issue in the community repository to request a TC member review the metric SDK (similar to open-telemetry/community#1355) once internal review is done.

MrAlias · 2023-08-31T14:57:18Z

Community issue opened: open-telemetry/community#1663

MrAlias · 2023-09-11T16:12:45Z

open-telemetry/community#1663 (comment)

Speaking for the TC, I have reviewed the v1.17.0/v0.40.0 OTel-Go metrics SDK.

Thank you @MrAlias, for conducting a thorough internal review. Having witnessed the output of that review, I am not surprised to find the OTel-Go metrics SDK in a healthy state of conformity. Here are some areas of feedback for consideration as the user-base grows.

Backwards/Forwards compatibility protection

I really admire the work that has been put in to protecting the interfaces from breaking changes. Users are explicitly prevented from providing implementations of interfaces that they are not supposed to, to protect the future ability of maintainers to improve the code. This sort of protection goes beyond what OTel requires, but ultimately this rigor will help OTel adoption and prevent the need for future breaking changes.

Straight-forward code is easy to follow

I have constructed at least two Go metrics SDK prototypes by now, so I am familiar with the territory. What I admire most about this code is that it remains simple and straightforward, despite there being plenty of opportunity to do complicated things to appease demanding users. It is a good thing for a community-maintained SDK to stay simple and avoid premature optimization, and if there are specific problems with performance, users will bring those to the maintainers and (because of the protections mentioned above) there will be a path forward for future optimizations.

As an example of this statement, I noticed that for each temporality, each aggregator has a method named cumulative() and delta(). These methods are nearly identical, and it might be tempting to factor the code in such a way that temporality controls were less transparent -- for example, the logic to aggregate could be the same if there were another piece of code responsible for resetting the aggregator state when temporality==delta. Of course, this is a very high-level claim, and the details in the code really matter. It could be that there are more compact or more performance ways to implement this SDK, but I'm glad to have a simple starting point.

Memory limits

As discussed in open-telemetry/opentelemetry-go#3006, the SDK contains some TODOs about what was eventually specified in open-telemetry/opentelemetry-specification#2960. I believe it is OK to call the SDK stable without addressing this issue first (as long as the maintainers agree). On the other hand, I'm familiar with cardinality explosions impacting the OTel Collector already, so this would be a nice safety feature to prioritize.

Performance concerns

I expect there will be users who are not entirely satisfied by the performance of the SDK, but I don't see it as a problem with the SDK itself. A number of memory-allocation optimizations have been applied already (e.g., support for re-use across collections, use of sync.Pool in various places). My one reservation is about the cost of constructing attribute.Set objects which actually happens in the API, not in the SDK. One of the downsides of the compatibility work mentioned above is the appearance of functional options in the instrument methods (e.g., AddOption, RecordOption, ObserveOption). While this pattern is ultimately very flexible, these are performance-sensitive calls and I would prefer to see a more direct path into the SDK.

I am hopeful that OTel will continue to evolve in ways that encourage more efficient APIs to exist, which is not to say what's here has to change. With what we have in the Go API today, the user may construct a []AddOption to precompute the input to an Add operation--however there's no way to mutate and reuse this object for a different attribute set. For users upgrading from a statsd-like API surface (where attributes are provided as a list to every operation), this forces a fairly expensive sequence of allocations into the call path. If the API didn't force this, the SDK could be optimized to avoid allocating an attribute.Set when it already has the same set indexed. This is more-or-less borrowing an idea from the Prometheus client library, which is to say that computing a hash function over attributes can allow searching for an existing attribute set without allocations (i.e., faster, less overhead). All of the options that I see for improving this situation, which includes an old idea inside OTel known as "Bound instruments", involve changing the API. When the users speak up, I expect there will be pressure to extend the API with new code paths--none of this will go against the specification, however.

One last area of concern (also on performance), is the use of sync.Mutex in places where writes are infrequent and updates are frequent. I've noticed users (of this SDK) with serious lock contention problems in such situations, and have replaced most sync.Mutex objects with other approaches.

Summary

This SDK passes review. Well done @MrAlias, @MadVikingGod, @pellared and the @open-telemetry/go-approvers! 🎉

MrAlias assigned jmacd Sep 8, 2023

MrAlias closed this as completed Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have OTel TC review metric SDK and sign-off on stable release #3674

Have OTel TC review metric SDK and sign-off on stable release #3674

MrAlias commented Feb 3, 2023 •

edited

MrAlias commented Aug 31, 2023

MrAlias commented Sep 11, 2023

Backwards/Forwards compatibility protection

Straight-forward code is easy to follow

Memory limits

Performance concerns

Summary

Have OTel TC review metric SDK and sign-off on stable release #3674

Have OTel TC review metric SDK and sign-off on stable release #3674

Comments

MrAlias commented Feb 3, 2023 • edited

MrAlias commented Aug 31, 2023

MrAlias commented Sep 11, 2023

Backwards/Forwards compatibility protection

Straight-forward code is easy to follow

Memory limits

Performance concerns

Summary

MrAlias commented Feb 3, 2023 •

edited