Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce key/value map bench #1121

Merged
merged 1 commit into from
Jun 22, 2023
Merged

Introduce key/value map bench #1121

merged 1 commit into from
Jun 22, 2023

Conversation

shaun-cox
Copy link
Contributor

@shaun-cox shaun-cox commented Jun 20, 2023

Two new benchmarks:

  • key_value_map for evaluating different implementation choices
  • span_builder to focus on performance analysis of SpanBuilder

Overview

There are four implementations considered:

  • EvictedHashMap: currently used by SpanData to carry span attributes
  • IndexMap: currently used by SpanBuilder to carry span attributes (in an OrderMap)
  • OneVec: Vec<(Key, Value>) where no hashing or duplicate key detection takes place
  • TwoVec: a Vec<Key> and a Vec<Value>, linked by shared indices, where no hashing or duplicate key detection takes place

There are two main operations to consider for each implementation:

  • lookup: find two attributes in "the map"
  • populate: populate n (2, 8, or 32) attributes in the map

lookup approximates what a sampling decision might do in consulting specific keys present in the SpanBuilder. Its performance looks like the following: (Note, the Vec based implementations are pessimized for worst-case, finding the last two attributes in the map.)

violin

OneVec beats IndexMap for 2 and 8 attributes in the map, but loses for 32 attributes.

populate approximates what anyone who wants to create a Span must pay to get attributes into the SpanBuilder. Its performance looks like the following:

violin

As expected, populating hash maps is a lot more expensive than populating vectors, but that's the cost of detecting duplicates.

Learnings

Knowing that we have to both populate and lookup when creating a Span, it's useful to see both together:

violin

Now we see that OneVec beats IndexMap for all cases, and is more than twice as fast as IndexMap for 32 attributes.

So this PR is meant to do three things:

  1. Give us a tool for further study.
  2. Help us come to a decision or better strategy about where to handle duplicate detection of attributes. (Personally, I think the cost should be paid for by the user or in the sdk as an option, but not in the api as is currently done.)
  3. Remind us all that O(1) vs. O(N) applies in the large, not necessarily for "tiny" data sets as we have here with attribute sets. So I think API design: SpanBuilder::attributes #794 should be revisited in this light. e.g. if a span processor or sampling decision wants to lookup attributes, it should be considering indexing to make those lookups faster in light of how many attributes are actually present and not necessarily make everyone who creates a SpanBuilder that ends up producing a Span that is non-recording pay for indexing that will never get used.

@shaun-cox shaun-cox requested a review from a team as a code owner June 20, 2023 20:44
@codecov
Copy link

codecov bot commented Jun 20, 2023

Codecov Report

Patch coverage has no change and project coverage change: -0.7 ⚠️

Comparison is base (71a64d7) 50.5% compared to head (c85a2b4) 49.8%.

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #1121     +/-   ##
=======================================
- Coverage   50.5%   49.8%   -0.7%     
=======================================
  Files        168     171      +3     
  Lines      19893   20171    +278     
=======================================
+ Hits       10060   10061      +1     
- Misses      9833   10110    +277     

see 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@djc
Copy link
Contributor

djc commented Jun 21, 2023

Awesome work! It makes a lot of sense to me that, even if duplicate detection is necessary, the data sizes are such that a simple sequential span might still be the optimal way to store data.

@cijothomas
Copy link
Member

Help us come to a decision or better strategy about where to handle duplicate detection of attributes. (Personally, I think the cost should be paid for by the user or in the sdk as an option, but not in the api as is currently done.)

I really like idea of removing duplication detection logic (i.e switch from HashMaps to OneVec/TwoVec ideas presented) from the API/SDK. If there is strong need to dedup, then it can be added as an optional span/log processor OR in the OTLPExporter itself, so the main user thread won't pay any cost. The spec actually allows flexibility in terms on where the de-dup should occur.

https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/common#attribute-collections
"The enforcement of uniqueness may be performed in a variety of ways as it best fits the limitations of the particular implementation."

(This topic was discussed in the SIG call on 06/20 as well, but after re-reading the spec I think it totally makes sense for this SIG to remove the costly de-dup.)

@jtescher jtescher merged commit 6682594 into open-telemetry:main Jun 22, 2023
11 of 12 checks passed
@shaun-cox shaun-cox deleted the ehm branch June 23, 2023 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants