Introduce key/value map bench #1121

shaun-cox · 2023-06-20T20:44:02Z

Two new benchmarks:

key_value_map for evaluating different implementation choices
span_builder to focus on performance analysis of SpanBuilder

Overview

There are four implementations considered:

EvictedHashMap: currently used by SpanData to carry span attributes
IndexMap: currently used by SpanBuilder to carry span attributes (in an OrderMap)
OneVec: Vec<(Key, Value>) where no hashing or duplicate key detection takes place
TwoVec: a Vec<Key> and a Vec<Value>, linked by shared indices, where no hashing or duplicate key detection takes place

There are two main operations to consider for each implementation:

lookup: find two attributes in "the map"
populate: populate n (2, 8, or 32) attributes in the map

lookup approximates what a sampling decision might do in consulting specific keys present in the SpanBuilder. Its performance looks like the following: (Note, the Vec based implementations are pessimized for worst-case, finding the last two attributes in the map.)

OneVec beats IndexMap for 2 and 8 attributes in the map, but loses for 32 attributes.

populate approximates what anyone who wants to create a Span must pay to get attributes into the SpanBuilder. Its performance looks like the following:

As expected, populating hash maps is a lot more expensive than populating vectors, but that's the cost of detecting duplicates.

Learnings

Knowing that we have to both populate and lookup when creating a Span, it's useful to see both together:

Now we see that OneVec beats IndexMap for all cases, and is more than twice as fast as IndexMap for 32 attributes.

So this PR is meant to do three things:

Give us a tool for further study.
Help us come to a decision or better strategy about where to handle duplicate detection of attributes. (Personally, I think the cost should be paid for by the user or in the sdk as an option, but not in the api as is currently done.)
Remind us all that O(1) vs. O(N) applies in the large, not necessarily for "tiny" data sets as we have here with attribute sets. So I think API design: SpanBuilder::attributes #794 should be revisited in this light. e.g. if a span processor or sampling decision wants to lookup attributes, it should be considering indexing to make those lookups faster in light of how many attributes are actually present and not necessarily make everyone who creates a SpanBuilder that ends up producing a Span that is non-recording pay for indexing that will never get used.

codecov · 2023-06-20T20:58:42Z

Codecov Report

Patch coverage has no change and project coverage change: -0.7 ⚠️

Comparison is base (71a64d7) 50.5% compared to head (c85a2b4) 49.8%.

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #1121     +/-   ##
=======================================
- Coverage   50.5%   49.8%   -0.7%     
=======================================
  Files        168     171      +3     
  Lines      19893   20171    +278     
=======================================
+ Hits       10060   10061      +1     
- Misses      9833   10110    +277

see 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

djc · 2023-06-21T07:36:47Z

Awesome work! It makes a lot of sense to me that, even if duplicate detection is necessary, the data sizes are such that a simple sequential span might still be the optimal way to store data.

cijothomas · 2023-06-21T22:19:08Z

Help us come to a decision or better strategy about where to handle duplicate detection of attributes. (Personally, I think the cost should be paid for by the user or in the sdk as an option, but not in the api as is currently done.)

I really like idea of removing duplication detection logic (i.e switch from HashMaps to OneVec/TwoVec ideas presented) from the API/SDK. If there is strong need to dedup, then it can be added as an optional span/log processor OR in the OTLPExporter itself, so the main user thread won't pay any cost. The spec actually allows flexibility in terms on where the de-dup should occur.

https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/common#attribute-collections
"The enforcement of uniqueness may be performed in a variety of ways as it best fits the limitations of the particular implementation."

(This topic was discussed in the SIG call on 06/20 as well, but after re-reading the spec I think it totally makes sense for this SIG to remove the costly de-dup.)

Introduce key/value map bench

c85a2b4

shaun-cox requested a review from a team as a code owner June 20, 2023 20:44

jtescher approved these changes Jun 22, 2023

View reviewed changes

jtescher merged commit 6682594 into open-telemetry:main Jun 22, 2023
10 of 11 checks passed

shaun-cox deleted the ehm branch June 23, 2023 01:32

lalitb mentioned this pull request Jul 5, 2023

[Logs SDK] Modify LogRecord to use Vector instead of OrderMap for attributes #1142

Merged

4 tasks

cijothomas mentioned this pull request Oct 3, 2023

SpanAttribute key deduplication #1284

Closed

cijothomas mentioned this pull request Oct 11, 2023

SpanAttributes modified to use Vec instead of OrderMap/EvictedHashMap #1293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce key/value map bench #1121

Introduce key/value map bench #1121

shaun-cox commented Jun 20, 2023 •

edited

Loading

codecov bot commented Jun 20, 2023

djc commented Jun 21, 2023

cijothomas commented Jun 21, 2023

Introduce key/value map bench #1121

Introduce key/value map bench #1121

Conversation

shaun-cox commented Jun 20, 2023 • edited Loading

Overview

Learnings

codecov bot commented Jun 20, 2023

Codecov Report

djc commented Jun 21, 2023

cijothomas commented Jun 21, 2023

shaun-cox commented Jun 20, 2023 •

edited

Loading