[vParquet3] new block encoding with support for dedicated columns #2649

stoewer · 2023-07-13T07:44:22Z

What this PR does:
This adds a new encoding vParquet3 to Tempo (the default encoding remains unchanged). The new encoding supports dedicated attribute columns as outlined by the design proposal.

Two aspects are missing from this PR:

A CLI to analyze blocks in order to find good candidates for dedicated attributes
Documentation about how to enable vParquet3 and configure dedicated attribute columns

Those will be subject to separate PRs.

Which issue(s) this PR fixes:
Contributes to #2527

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

joe-elliott · 2023-07-15T16:38:34Z

Will give this a proper review in time. Heads up on additional parquet changes here: https://github.com/grafana/tempo/pull/2660/files#

joe-elliott

Partial review. About 1/3 of the way through by file count.

cmd/tempo-serverless/handler.go

pkg/api/http.go

cmd/tempo/app/modules.go

modules/ingester/ingester_test.go

tempodb/backend/block_meta.go

stoewer · 2023-07-19T01:50:34Z

I was thinking about implementing json.Marshaler and json.Unarshaler for backend.DedicatedColumn and change the marshaling behavior such that type defaults to "string" and scope defaults to "span"

The following dedicated columns:

backend.DedicatedColumns{
	{Scope: "span", Name: "net.host.name", Type: "string"},
	{Scope: "span", Name: "net.host.addr", Type: "string"},
	{Scope: "resource", Name: "dc.region", Type: "string"},
}

Would be marshaled to the following JSON and vice versa:

[{"name": "net.host.name"}, {"name": "net.host.addr"}, {"scope":"resource","dc.region"}]

This would also reduce the size of meta.json and index.json.gz. What do you think? It would also be possible to do this as an optimization in a different PR.

joe-elliott · 2023-07-19T13:34:53Z

I was thinking about implementing json.Marshaler and json.Unarshaler for backend.DedicatedColumn and change the marshaling behavior such that type defaults to "string" and scope defaults to "span"

Agreed that a custom json marshaler to reduce size is wise. Especially since this is being encoded in an url. We could even go a step further and reduce the field names to single chars like "n" instead of "name". I'm fine with doing this in a later PR after we've done some benchmarking at scale.

joe-elliott

Overall excellent work. A very well thought out addition. Some comments and questions I'd like to work through before merging. Nothing major.

One thing I would like to see before merging is the addition of dedicated columns to our search integration tests in /tempodb/tempodb_search_test.go. You should be able to just add some relevant columns to the config here. Let's both add a column that's in the test data as well as one that's not.

https://github.com/grafana/tempo/blob/main/tempodb/tempodb_search_test.go#L678

tempodb/encoding/common/config.go

...db/encoding/vparquet3/test-data/single-tenant/b27b0e53-66a0-4505-afd6-434ae3cd4a10/meta.json

tempodb/encoding/vparquet3/create.go

tempodb/encoding/vparquet3/block_search.go

joe-elliott · 2023-07-19T14:59:30Z

tempodb/encoding/vparquet3/dedicated_columns.go

+
+// dedicatedColumnsToColumnMapping returns mapping from attribute names to spare columns for a give
+// block meta and scope.
+func dedicatedColumnsToColumnMapping(dedicatedColumns backend.DedicatedColumns, scopes ...backend.DedicatedColumnScope) dedicatedColumnMapping {


the order of dedicated columns is extremely important here. is ordering guaranteed in the various formats we are serializing to/from? json/yaml/proto?

having been through the entire PR at this point, i think this is my biggest concern. We pass the dedicated columns and various representations of them through all manner of in memory and serialized representations. if the order ever changes then we are reading the columns incorrectly.

i think some integration level testing would help with these fears.

AFAIK arrays are ordered per standard definition in JSON, YAML, and protocol buffers. Therefore, everything should work as long as the de-/serialization is compliant with the respective standard.

It's still a good idea to add an integration test 👍

You can probably just add dedicated columns configuration to our existing tests under /integration/e2e and that would be sufficient.

I created a new test integration/e2e/encoding_test.go and added dedicated attribute columns to TestSearchCompleteBlock

tempodb/encoding/vparquet3/block_search_tags.go

tempodb/encoding/vparquet3/dedicated_columns.go

tempodb/encoding/vparquet3/schema.go

joe-elliott · 2023-07-19T15:57:45Z

tempodb/backend/block_meta.go

+}
+
+func (dc *DedicatedColumn) Validate() error {
+	if dc.Name == "" {


what is the behavior if a user configures an existing well known column such as service.name at the resource level? should we fail validation in this case? warn the user?

For reading and writing the well known attributes (i.e. service.name or http.url) are always handled first. Configuring a dedicated column for a well known attribute has no effect, except a spare column being wasted that will remain empty.

It would be good to validate this. However, to do this properly we would need to make well known attributes available in tempodb/backend. To do this we would have to either import something from encoding/vparquet3 directly or extend VersionedEncoding. Given the small impact of well known attributes in the dedicated column configuration, I think it’s OK not to validate this.

mdisibio · 2023-07-19T20:45:12Z

pkg/api/http.go

+	urlParamVersion          = "version"
+	urlParamSize             = "size"
+	urlParamFooterSize       = "footerSize"
+	urlParamDedicatedColumns = "dc"


I understand why this is needed, but I think exposing the parquet config here is indicative of a leaky abstraction. Not blocking, but tempted to just send the BlockMeta instead of all the individual params.

I agree. It would be nice to refactor the way the search block request is built. But I think it's better to do this outside of a larger feature PR.

stoewer · 2023-07-20T00:58:01Z

We could even go a step further and reduce the field names to single chars like "n" instead of "name"

If we go for the single char field names it would be better to do the change in this PR, otherwise we could run into compatibility problems between meta.json files with single character and multi character field names.
Therefore I changed the block meta marshaling in my last commit.

Ref: grafana#2649 (comment)

) * Re-order schema to keep columns affected by column index changes low * Add spare columns for dedicated attributes to schema struct * Add dedicated column config to block meta * Read and write attributes in dedicated columns * Make order of dedicated attributes predictable when reading * Fix existing tests and benchmark * Run exiting benchmarks and tests with dedicated columns Co-authored-by: Mario <mariorvinas@gmail.com>

* Add dedicated columns to overrides and blocks * Improvements * Change test * Fix tests * Extend ingester_test: * Add dedicated columns config to storage block * Review comments * Add comment

* Refactor and rename function blockMetaToDedicatedColumnMapping * Query dedicated attribute columns with TraceQL * Search tag values in dedicated attribute columns * Search tags in dedicated attribute columns * Search for values in dedicated attribute columns in tests * More consistent naming * Update block and meta.json in vparquet2/test-data * Test dedicated column in traceToParquet test * Format Go code * Introduce types for dedicated column type and scope Replace StaticTypeFromString() with DedicatedColumnType.ToStaticType() * The function dedicatedColumnsToColumnMapping() can receive multiple scopes

* Re-order schema to keep columns affected by column index changes low * Add spare columns for dedicated attributes to schema struct * Add dedicated column config to block meta * Read and write attributes in dedicated columns * Make order of dedicated attributes predictable when reading * Fix existing tests and benchmark * Run exiting benchmarks and tests with dedicated columns * Add dedicated columns to overrides and blocks * Support dedicated columns in compactor block selection * Changes to hash * More tests --------- Co-authored-by: A. Stoewer <adrian@stoewer.me>

* Add dedicated columns to SearchBlockRequest message * Assign SearchBlockRequest dedicated cols from BlockMeta and vice versa * Encode SearchBlockRequest to http request and vice versa * Don't add empty dedicated columns when building a search request * Unit tests with dedicated columns * Implement dedicated column scope and type as protobuf enums

* Add validate function * Refactor: use DedicatedColumns type instead of []DedicatedColumn * Initialize logger before verifying the config This fixes the config verification output * Check for invalid dedicated columns with '-config.verify true' * Use ToTempopb() to validate dedicated column scope and type

* Remove TODO comment about caching the dedicated column hash * Shorten url param for dedicated columns to 'dc' * Add function to get latest encoding and use it in tests * Fix name DedicateColumnsFromTempopb

* Remove 'Test' columns from vParquet3 schema * Rename async iterator environment variable * Do not export methods of dedicatedColumnMapping * Skip dedicated attribute lookup depending on scope in searchTagValues * Validate maximum number of configured dedicated columns * Test data for vparquet3 uses dedicated columns * Reduce size of block meta JSON * Use 'parquet_' prefix for dedicated column configuration

* Add e2e tests for encodings and dedicated attribute columns * Use dedicated attribute columns in TestSearchCompleteBlock

joe-elliott

here we go!

integration/e2e/encodings_test.go

* vParquet3 docs * Apply suggestions from code review Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com> * Address comments * Apply suggestions from code review Co-authored-by: A. Stoewer <adrian@stoewer.me> * Apply suggestions from code review Co-authored-by: A. Stoewer <adrian@stoewer.me> * Add tempo-cli example * Update config param name Ref: #2649 (comment) --------- Co-authored-by: Kim Nylander <104772500+knylander-grafana@users.noreply.github.com> Co-authored-by: A. Stoewer <adrian@stoewer.me>

stoewer force-pushed the vparquet3-release branch 6 times, most recently from 9b685a5 to 2c49268 Compare July 14, 2023 04:35

stoewer changed the title ~~[vParquet3] release new encoding with support for dedicated columns~~ [vParquet3] new block encoding with support for dedicated columns Jul 14, 2023

stoewer marked this pull request as ready for review July 14, 2023 06:10

stoewer requested review from joe-elliott, annanay25, mdisibio, mapno, kvrhdn, zalegrala, electron0zero and ie-pham as code owners July 14, 2023 06:10

mapno mentioned this pull request Jul 17, 2023

[vParquet3] docs #2664

Merged

3 tasks

joe-elliott reviewed Jul 18, 2023

View reviewed changes

joe-elliott reviewed Jul 19, 2023

View reviewed changes

mdisibio reviewed Jul 19, 2023

View reviewed changes

stoewer force-pushed the vparquet3-release branch 4 times, most recently from 6ca3945 to f690a2c Compare July 21, 2023 08:33

mapno added a commit to mapno/tempo that referenced this pull request Jul 21, 2023

Update config param name

1e8a6f4

Ref: grafana#2649 (comment)

stoewer force-pushed the vparquet3-release branch from ef959b9 to 18422aa Compare July 26, 2023 01:29

stoewer force-pushed the vparquet3-release branch 2 times, most recently from b29b0cb to 7d76cc6 Compare July 26, 2023 02:19

stoewer and others added 12 commits July 26, 2023 12:24

[vParquet3] create new block encoding by copying vParquet2

0a881de

Add dedicated columns to overrides module (#2551)

a084ad8

[vParquet3] Write path (#2555)

727b7e9

* Add dedicated columns to overrides and blocks * Improvements * Change test * Fix tests * Extend ingester_test: * Add dedicated columns config to storage block * Review comments * Add comment

[vParquet3] mention feature in CHANGELOG.md

641d0c1

[vParquet3] Address review comments

fb8fb3a

* Remove TODO comment about caching the dedicated column hash * Shorten url param for dedicated columns to 'dc' * Add function to get latest encoding and use it in tests * Fix name DedicateColumnsFromTempopb

[vParquet3] Integration tests with dedicated attribute columns

a91bffd

* Add e2e tests for encodings and dedicated attribute columns * Use dedicated attribute columns in TestSearchCompleteBlock

stoewer force-pushed the vparquet3-release branch from 7d76cc6 to a91bffd Compare July 26, 2023 08:43

joe-elliott approved these changes Jul 26, 2023

View reviewed changes

integration/e2e/encodings_test.go Outdated Show resolved Hide resolved

Add support for v2 in encodings test

ed138a6

mapno enabled auto-merge (squash) July 26, 2023 17:23

mapno merged commit bcc7924 into main Jul 26, 2023
15 checks passed

mapno deleted the vparquet3-release branch July 26, 2023 17:30

knylander-grafana mentioned this pull request Aug 18, 2023

[DOC] Tempo 2.2.1 release notes #2811

Closed

stoewer mentioned this pull request Aug 22, 2023

Fix incorrectly reported vulture errors #2826

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vParquet3] new block encoding with support for dedicated columns #2649

[vParquet3] new block encoding with support for dedicated columns #2649

stoewer commented Jul 13, 2023 •

edited

joe-elliott commented Jul 15, 2023

joe-elliott left a comment

stoewer commented Jul 19, 2023 •

edited

joe-elliott commented Jul 19, 2023

joe-elliott left a comment

joe-elliott Jul 19, 2023

joe-elliott Jul 19, 2023

stoewer Jul 21, 2023

joe-elliott Jul 21, 2023

stoewer Jul 26, 2023

joe-elliott Jul 19, 2023

stoewer Jul 21, 2023 •

edited

mdisibio Jul 19, 2023

stoewer Jul 21, 2023

stoewer commented Jul 20, 2023 •

edited

joe-elliott left a comment

[vParquet3] new block encoding with support for dedicated columns #2649

[vParquet3] new block encoding with support for dedicated columns #2649

Conversation

stoewer commented Jul 13, 2023 • edited

joe-elliott commented Jul 15, 2023

joe-elliott left a comment

Choose a reason for hiding this comment

stoewer commented Jul 19, 2023 • edited

joe-elliott commented Jul 19, 2023

joe-elliott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stoewer Jul 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stoewer commented Jul 20, 2023 • edited

joe-elliott left a comment

Choose a reason for hiding this comment

stoewer commented Jul 13, 2023 •

edited

stoewer commented Jul 19, 2023 •

edited

stoewer Jul 21, 2023 •

edited

stoewer commented Jul 20, 2023 •

edited