Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vParquet3] new block encoding with support for dedicated columns #2649

Merged
merged 13 commits into from Jul 26, 2023

Conversation

stoewer
Copy link
Collaborator

@stoewer stoewer commented Jul 13, 2023

What this PR does:
This adds a new encoding vParquet3 to Tempo (the default encoding remains unchanged). The new encoding supports dedicated attribute columns as outlined by the design proposal.

Two aspects are missing from this PR:

  • A CLI to analyze blocks in order to find good candidates for dedicated attributes
  • Documentation about how to enable vParquet3 and configure dedicated attribute columns

Those will be subject to separate PRs.

Which issue(s) this PR fixes:
Contributes to #2527

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@stoewer stoewer force-pushed the vparquet3-release branch 6 times, most recently from 9b685a5 to 2c49268 Compare July 14, 2023 04:35
@stoewer stoewer changed the title [vParquet3] release new encoding with support for dedicated columns [vParquet3] new block encoding with support for dedicated columns Jul 14, 2023
@stoewer stoewer marked this pull request as ready for review July 14, 2023 06:10
@joe-elliott
Copy link
Member

Will give this a proper review in time. Heads up on additional parquet changes here: https://github.com/grafana/tempo/pull/2660/files#

@mapno mapno mentioned this pull request Jul 17, 2023
3 tasks
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. About 1/3 of the way through by file count.

cmd/tempo-serverless/handler.go Outdated Show resolved Hide resolved
cmd/tempo-serverless/handler.go Show resolved Hide resolved
pkg/api/http.go Outdated Show resolved Hide resolved
pkg/api/http.go Show resolved Hide resolved
cmd/tempo/app/modules.go Show resolved Hide resolved
modules/ingester/ingester_test.go Outdated Show resolved Hide resolved
tempodb/backend/block_meta.go Show resolved Hide resolved
tempodb/backend/block_meta.go Outdated Show resolved Hide resolved
@stoewer
Copy link
Collaborator Author

stoewer commented Jul 19, 2023

I was thinking about implementing json.Marshaler and json.Unarshaler for backend.DedicatedColumn and change the marshaling behavior such that type defaults to "string" and scope defaults to "span"

The following dedicated columns:

backend.DedicatedColumns{
	{Scope: "span", Name: "net.host.name", Type: "string"},
	{Scope: "span", Name: "net.host.addr", Type: "string"},
	{Scope: "resource", Name: "dc.region", Type: "string"},
}

Would be marshaled to the following JSON and vice versa:

[{"name": "net.host.name"}, {"name": "net.host.addr"}, {"scope":"resource","dc.region"}]

This would also reduce the size of meta.json and index.json.gz. What do you think? It would also be possible to do this as an optimization in a different PR.

@joe-elliott
Copy link
Member

I was thinking about implementing json.Marshaler and json.Unarshaler for backend.DedicatedColumn and change the marshaling behavior such that type defaults to "string" and scope defaults to "span"

Agreed that a custom json marshaler to reduce size is wise. Especially since this is being encoded in an url. We could even go a step further and reduce the field names to single chars like "n" instead of "name". I'm fine with doing this in a later PR after we've done some benchmarking at scale.

Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall excellent work. A very well thought out addition. Some comments and questions I'd like to work through before merging. Nothing major.

One thing I would like to see before merging is the addition of dedicated columns to our search integration tests in /tempodb/tempodb_search_test.go. You should be able to just add some relevant columns to the config here. Let's both add a column that's in the test data as well as one that's not.

https://github.com/grafana/tempo/blob/main/tempodb/tempodb_search_test.go#L678

tempodb/encoding/common/config.go Outdated Show resolved Hide resolved
tempodb/encoding/vparquet3/create.go Show resolved Hide resolved
tempodb/encoding/vparquet3/block_search.go Outdated Show resolved Hide resolved

// dedicatedColumnsToColumnMapping returns mapping from attribute names to spare columns for a give
// block meta and scope.
func dedicatedColumnsToColumnMapping(dedicatedColumns backend.DedicatedColumns, scopes ...backend.DedicatedColumnScope) dedicatedColumnMapping {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the order of dedicated columns is extremely important here. is ordering guaranteed in the various formats we are serializing to/from? json/yaml/proto?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having been through the entire PR at this point, i think this is my biggest concern. We pass the dedicated columns and various representations of them through all manner of in memory and serialized representations. if the order ever changes then we are reading the columns incorrectly.

i think some integration level testing would help with these fears.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK arrays are ordered per standard definition in JSON, YAML, and protocol buffers. Therefore, everything should work as long as the de-/serialization is compliant with the respective standard.

It's still a good idea to add an integration test 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably just add dedicated columns configuration to our existing tests under /integration/e2e and that would be sufficient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a new test integration/e2e/encoding_test.go and added dedicated attribute columns to TestSearchCompleteBlock

tempodb/encoding/vparquet3/block_search_tags.go Outdated Show resolved Hide resolved
tempodb/encoding/vparquet3/dedicated_columns.go Outdated Show resolved Hide resolved
tempodb/encoding/vparquet3/schema.go Outdated Show resolved Hide resolved
tempodb/encoding/vparquet3/schema.go Show resolved Hide resolved
}

func (dc *DedicatedColumn) Validate() error {
if dc.Name == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the behavior if a user configures an existing well known column such as service.name at the resource level? should we fail validation in this case? warn the user?

Copy link
Collaborator Author

@stoewer stoewer Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reading and writing the well known attributes (i.e. service.name or http.url) are always handled first. Configuring a dedicated column for a well known attribute has no effect, except a spare column being wasted that will remain empty.

It would be good to validate this. However, to do this properly we would need to make well known attributes available in tempodb/backend. To do this we would have to either import something from encoding/vparquet3 directly or extend VersionedEncoding. Given the small impact of well known attributes in the dedicated column configuration, I think it’s OK not to validate this.

urlParamVersion = "version"
urlParamSize = "size"
urlParamFooterSize = "footerSize"
urlParamDedicatedColumns = "dc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why this is needed, but I think exposing the parquet config here is indicative of a leaky abstraction. Not blocking, but tempted to just send the BlockMeta instead of all the individual params.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It would be nice to refactor the way the search block request is built. But I think it's better to do this outside of a larger feature PR.

@stoewer
Copy link
Collaborator Author

stoewer commented Jul 20, 2023

We could even go a step further and reduce the field names to single chars like "n" instead of "name"

If we go for the single char field names it would be better to do the change in this PR, otherwise we could run into compatibility problems between meta.json files with single character and multi character field names.
Therefore I changed the block meta marshaling in my last commit.

@stoewer stoewer force-pushed the vparquet3-release branch 4 times, most recently from 6ca3945 to f690a2c Compare July 21, 2023 08:33
mapno added a commit to mapno/tempo that referenced this pull request Jul 21, 2023
@stoewer stoewer force-pushed the vparquet3-release branch 2 times, most recently from b29b0cb to 7d76cc6 Compare July 26, 2023 02:19
stoewer and others added 12 commits July 26, 2023 12:24
)

* Re-order schema to keep columns affected by column index changes low

* Add spare columns for dedicated attributes to schema struct

* Add dedicated column config to block meta

* Read and write attributes in dedicated columns

* Make order of dedicated attributes predictable when reading

* Fix existing tests and benchmark

* Run exiting benchmarks and tests with dedicated columns

Co-authored-by: Mario <mariorvinas@gmail.com>
* Add dedicated columns to overrides and blocks

* Improvements

* Change test

* Fix tests

* Extend ingester_test:

* Add dedicated columns config to storage block

* Review comments

* Add comment
* Refactor and rename function blockMetaToDedicatedColumnMapping

* Query dedicated attribute columns with TraceQL

* Search tag values in dedicated attribute columns

* Search tags in dedicated attribute columns

* Search for values in dedicated attribute columns in tests

* More consistent naming

* Update block and meta.json in vparquet2/test-data

* Test dedicated column in traceToParquet test

* Format Go code

* Introduce types for dedicated column type and scope

Replace StaticTypeFromString() with DedicatedColumnType.ToStaticType()

* The function dedicatedColumnsToColumnMapping() can receive multiple scopes
* Re-order schema to keep columns affected by column index changes low

* Add spare columns for dedicated attributes to schema struct

* Add dedicated column config to block meta

* Read and write attributes in dedicated columns

* Make order of dedicated attributes predictable when reading

* Fix existing tests and benchmark

* Run exiting benchmarks and tests with dedicated columns

* Add dedicated columns to overrides and blocks

* Support dedicated columns in compactor block selection

* Changes to hash

* More tests

---------

Co-authored-by: A. Stoewer <adrian@stoewer.me>
* Add dedicated columns to SearchBlockRequest message

* Assign SearchBlockRequest dedicated cols from BlockMeta and vice versa

* Encode SearchBlockRequest to http request and vice versa

* Don't add empty dedicated columns when building a search request

* Unit tests with dedicated columns

* Implement dedicated column scope and type as protobuf enums
* Add validate function

* Refactor: use DedicatedColumns type instead of []DedicatedColumn

* Initialize logger before verifying the config

This fixes the config verification output

* Check for invalid dedicated columns with '-config.verify true'

* Use ToTempopb() to validate dedicated column scope and type
* Remove TODO comment about caching the dedicated column hash

* Shorten url param for dedicated columns to 'dc'

* Add function to get latest encoding and use it in tests

* Fix name DedicateColumnsFromTempopb
* Remove 'Test' columns from vParquet3 schema

* Rename async iterator environment variable

* Do not export methods of dedicatedColumnMapping

* Skip dedicated attribute lookup depending on scope in searchTagValues

* Validate maximum number of configured dedicated columns

* Test data for vparquet3 uses dedicated columns

* Reduce size of block meta JSON

* Use 'parquet_' prefix for dedicated column configuration
* Add e2e tests for encodings and dedicated attribute columns

* Use dedicated attribute columns in TestSearchCompleteBlock
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we go!

integration/e2e/encodings_test.go Outdated Show resolved Hide resolved
@mapno mapno enabled auto-merge (squash) July 26, 2023 17:23
@mapno mapno merged commit bcc7924 into main Jul 26, 2023
15 checks passed
@mapno mapno deleted the vparquet3-release branch July 26, 2023 17:30
mapno added a commit that referenced this pull request Jul 26, 2023
* vParquet3 docs

* Apply suggestions from code review

Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>

* Address comments

* Apply suggestions from code review

Co-authored-by: A. Stoewer <adrian@stoewer.me>

* Apply suggestions from code review

Co-authored-by: A. Stoewer <adrian@stoewer.me>

* Add tempo-cli example

* Update config param name

Ref: #2649 (comment)

---------

Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com>
Co-authored-by: A. Stoewer <adrian@stoewer.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants