New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vParquet3] new block encoding with support for dedicated columns #2649
Conversation
9b685a5
to
2c49268
Compare
Will give this a proper review in time. Heads up on additional parquet changes here: https://github.com/grafana/tempo/pull/2660/files# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review. About 1/3 of the way through by file count.
I was thinking about implementing The following dedicated columns: backend.DedicatedColumns{
{Scope: "span", Name: "net.host.name", Type: "string"},
{Scope: "span", Name: "net.host.addr", Type: "string"},
{Scope: "resource", Name: "dc.region", Type: "string"},
} Would be marshaled to the following JSON and vice versa: [{"name": "net.host.name"}, {"name": "net.host.addr"}, {"scope":"resource","dc.region"}] This would also reduce the size of |
Agreed that a custom json marshaler to reduce size is wise. Especially since this is being encoded in an url. We could even go a step further and reduce the field names to single chars like "n" instead of "name". I'm fine with doing this in a later PR after we've done some benchmarking at scale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall excellent work. A very well thought out addition. Some comments and questions I'd like to work through before merging. Nothing major.
One thing I would like to see before merging is the addition of dedicated columns to our search integration tests in /tempodb/tempodb_search_test.go
. You should be able to just add some relevant columns to the config here. Let's both add a column that's in the test data as well as one that's not.
https://github.com/grafana/tempo/blob/main/tempodb/tempodb_search_test.go#L678
...db/encoding/vparquet3/test-data/single-tenant/b27b0e53-66a0-4505-afd6-434ae3cd4a10/meta.json
Outdated
Show resolved
Hide resolved
|
||
// dedicatedColumnsToColumnMapping returns mapping from attribute names to spare columns for a give | ||
// block meta and scope. | ||
func dedicatedColumnsToColumnMapping(dedicatedColumns backend.DedicatedColumns, scopes ...backend.DedicatedColumnScope) dedicatedColumnMapping { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the order of dedicated columns is extremely important here. is ordering guaranteed in the various formats we are serializing to/from? json/yaml/proto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having been through the entire PR at this point, i think this is my biggest concern. We pass the dedicated columns and various representations of them through all manner of in memory and serialized representations. if the order ever changes then we are reading the columns incorrectly.
i think some integration level testing would help with these fears.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK arrays are ordered per standard definition in JSON, YAML, and protocol buffers. Therefore, everything should work as long as the de-/serialization is compliant with the respective standard.
It's still a good idea to add an integration test 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably just add dedicated columns configuration to our existing tests under /integration/e2e
and that would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new test integration/e2e/encoding_test.go
and added dedicated attribute columns to TestSearchCompleteBlock
} | ||
|
||
func (dc *DedicatedColumn) Validate() error { | ||
if dc.Name == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the behavior if a user configures an existing well known column such as service.name
at the resource level? should we fail validation in this case? warn the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reading and writing the well known attributes (i.e. service.name
or http.url
) are always handled first. Configuring a dedicated column for a well known attribute has no effect, except a spare column being wasted that will remain empty.
It would be good to validate this. However, to do this properly we would need to make well known attributes available in tempodb/backend
. To do this we would have to either import something from encoding/vparquet3
directly or extend VersionedEncoding
. Given the small impact of well known attributes in the dedicated column configuration, I think it’s OK not to validate this.
urlParamVersion = "version" | ||
urlParamSize = "size" | ||
urlParamFooterSize = "footerSize" | ||
urlParamDedicatedColumns = "dc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why this is needed, but I think exposing the parquet config here is indicative of a leaky abstraction. Not blocking, but tempted to just send the BlockMeta instead of all the individual params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It would be nice to refactor the way the search block request is built. But I think it's better to do this outside of a larger feature PR.
If we go for the single char field names it would be better to do the change in this PR, otherwise we could run into compatibility problems between |
6ca3945
to
f690a2c
Compare
ef959b9
to
18422aa
Compare
b29b0cb
to
7d76cc6
Compare
) * Re-order schema to keep columns affected by column index changes low * Add spare columns for dedicated attributes to schema struct * Add dedicated column config to block meta * Read and write attributes in dedicated columns * Make order of dedicated attributes predictable when reading * Fix existing tests and benchmark * Run exiting benchmarks and tests with dedicated columns Co-authored-by: Mario <mariorvinas@gmail.com>
* Add dedicated columns to overrides and blocks * Improvements * Change test * Fix tests * Extend ingester_test: * Add dedicated columns config to storage block * Review comments * Add comment
* Refactor and rename function blockMetaToDedicatedColumnMapping * Query dedicated attribute columns with TraceQL * Search tag values in dedicated attribute columns * Search tags in dedicated attribute columns * Search for values in dedicated attribute columns in tests * More consistent naming * Update block and meta.json in vparquet2/test-data * Test dedicated column in traceToParquet test * Format Go code * Introduce types for dedicated column type and scope Replace StaticTypeFromString() with DedicatedColumnType.ToStaticType() * The function dedicatedColumnsToColumnMapping() can receive multiple scopes
* Re-order schema to keep columns affected by column index changes low * Add spare columns for dedicated attributes to schema struct * Add dedicated column config to block meta * Read and write attributes in dedicated columns * Make order of dedicated attributes predictable when reading * Fix existing tests and benchmark * Run exiting benchmarks and tests with dedicated columns * Add dedicated columns to overrides and blocks * Support dedicated columns in compactor block selection * Changes to hash * More tests --------- Co-authored-by: A. Stoewer <adrian@stoewer.me>
* Add dedicated columns to SearchBlockRequest message * Assign SearchBlockRequest dedicated cols from BlockMeta and vice versa * Encode SearchBlockRequest to http request and vice versa * Don't add empty dedicated columns when building a search request * Unit tests with dedicated columns * Implement dedicated column scope and type as protobuf enums
* Add validate function * Refactor: use DedicatedColumns type instead of []DedicatedColumn * Initialize logger before verifying the config This fixes the config verification output * Check for invalid dedicated columns with '-config.verify true' * Use ToTempopb() to validate dedicated column scope and type
* Remove TODO comment about caching the dedicated column hash * Shorten url param for dedicated columns to 'dc' * Add function to get latest encoding and use it in tests * Fix name DedicateColumnsFromTempopb
* Remove 'Test' columns from vParquet3 schema * Rename async iterator environment variable * Do not export methods of dedicatedColumnMapping * Skip dedicated attribute lookup depending on scope in searchTagValues * Validate maximum number of configured dedicated columns * Test data for vparquet3 uses dedicated columns * Reduce size of block meta JSON * Use 'parquet_' prefix for dedicated column configuration
* Add e2e tests for encodings and dedicated attribute columns * Use dedicated attribute columns in TestSearchCompleteBlock
7d76cc6
to
a91bffd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we go!
* vParquet3 docs * Apply suggestions from code review Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com> * Address comments * Apply suggestions from code review Co-authored-by: A. Stoewer <adrian@stoewer.me> * Apply suggestions from code review Co-authored-by: A. Stoewer <adrian@stoewer.me> * Add tempo-cli example * Update config param name Ref: #2649 (comment) --------- Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com> Co-authored-by: A. Stoewer <adrian@stoewer.me>
What this PR does:
This adds a new encoding
vParquet3
to Tempo (the default encoding remains unchanged). The new encoding supports dedicated attribute columns as outlined by the design proposal.Two aspects are missing from this PR:
vParquet3
and configure dedicated attribute columnsThose will be subject to separate PRs.
Which issue(s) this PR fixes:
Contributes to #2527
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]