Investigate `_series_id` model #24815

pauldix · 2024-03-25T01:01:41Z

I'm calling this the _series_id model of mapping LP to the 3.0 table model. The idea goes like this:

Every row has a _series_id which is a 32 byte SHA265 hash of the tagset (tags appear in lexicographical order)
Data is stored into Parquet files, deduped and sorted by _series_id
Tags columns just like fields. We keep the distinction in the metadata for compatibility with InfluxQL. They need not be dictionary encoded
String columns (which include tags) will only use dictionary encoding when being persisted if their cardinality is < N (need to figure out what a good value of N is)

My thinking is that this will make sorting easier and it'll keep it consistent across the table for all time, even when new columns are introduced. This will also give an entry point later to create indexes on other columns for value -> list of series ids. This could help with single or low cardinality series lookups.

This will also remove the mandatory dictionary encoding for all tag columns, which can be problematic with high cardinality tags.

Lastly, this is something that will be used in the 3.0 data model that I'm envisioning. So this will create a bridge between the 1.x/2.x data model and 3.x.

Here's a list of tasks I think we'll need to investigate this model:

Tasks

Give feedback

Add a series_id accessor to ParsedLine #24821

epic/perf-prototyping v3
Support column type for storing bytes #24845

epic/perf-prototyping v3
Benchmark sort/deduplicate on some data sets with 1-15 tags and with different cardinalities #24919

4 of 4

epic/perf-prototyping v3
Sort by _series_id instead of tags #24909

epic/perf-prototyping v3
Separate query API for experimentation #24822

epic/perf-prototyping v3
Investigate how to have just series_id as a key when working with iox_query package
Investigate how to have tags that are not dictionary encoded and not part of primary key (just another column? How to work with InfluxQL?)
Options

The text was updated successfully, but these errors were encountered:

hiltontj · 2024-04-23T13:26:12Z

A fair bit of work has been done on the open sub-issues listed above. Much of the work is sitting on issues and branches/PRs that have not been merged, so I wanted to summarize that here.

#24845 - Support column type for storing bytes

Extended the schema definitions to support a SeriesId column type for storing a FixedSizeBinary, which is treated as a primary key. Changes can be seen in influxdb3_core on feat: support SeriesId as a column type influxdb3_core#12
The _series_id is stored in parquet using DELTA_BYTE_ARRAY encoding, which made a big improvement on compressed file size: Sort by _series_id instead of tags #24909 (comment)
Measure the impact on parquet file size of adding the _series_id column Support column type for storing bytes #24845 (comment). Outcomes:
- the impact on size varies with cardinality, so compaction would reduce its impact
- takes about as much space as a float field

#24919 - Benchmark sort/dedupe by tags vs. `_series_id`

_series_id improves compaction time as the number of tags in the data set increases. This trend is consistent in response to:
- varying thread count: Benchmark sort/deduplicate on some data sets with 1-15 tags and with different cardinalities #24919 (comment)
- varying cardinality: Benchmark sort/deduplicate on some data sets with 1-15 tags and with different cardinalities #24919 (comment)
- varying tag length: Benchmark sort/deduplicate on some data sets with 1-15 tags and with different cardinalities #24919 (comment)
_series_id can improve maximum memory usage during compaction when there are a larger number of tags in the data set, but can degrade maximum memory usage when there are few tags: Benchmark sort/deduplicate on some data sets with 1-15 tags and with different cardinalities #24919 (comment)

pauldix · 2024-04-23T14:46:14Z

Thanks for writing up the summary, that's very helpful. I'd like to sit on this for a few days. It's not clear that there's more to do here or that we should move forward with it, but I need to think on it for a bit.

pauldix · 2024-05-08T22:20:44Z

I'm closing this out for now. I think that what the testing here showed overall was that there was some storage impact, which would be greatly mitigated by reshuffling the data so that a single series is clustered together in fewer Parquet files. Compaction was improved in cases where there were more than 2 tags using less CPU and RAM.

Originally, I thought we'd have to use this to build an index to do quick single series lookups, but I've found a better pathway to get to that. I also speculated that the gains for compaction would be greater than these tests have shown.

One of the other concerns I had about being able to consistently sort data over time using the series_id likely won't be a problem in the v3 data model.

Basically, the gains didn't seem impactful enough to continue down this path. We can revisit in the future if it seems like something that will be useful.

pauldix added the v3 label Mar 25, 2024

pauldix mentioned this issue Mar 25, 2024

Investigate Parquet index #24819

Open

This was referenced Mar 25, 2024

Edge Alpha #24650

Open

Separate query API for experimentation #24822

Closed

pauldix added the epic/perf-prototyping label Mar 25, 2024

hiltontj changed the title ~~Investigate _series_id model~~ Investigate _series_id model Mar 25, 2024

hiltontj mentioned this issue Mar 26, 2024

Support column type for storing bytes #24845

Closed

hiltontj self-assigned this Apr 11, 2024

This was referenced May 7, 2024

chore: remove _series_id #24969

Merged

feat: add a compaction test to the load generator #24925

Closed

pauldix closed this as completed May 8, 2024

This was referenced May 13, 2024

feat: support 32-byte fixed size binary field type influxdata/influxdb3_core#13

Closed

feat: support SeriesId as a column type influxdata/influxdb3_core#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate `_series_id` model #24815

Investigate `_series_id` model #24815

pauldix commented Mar 25, 2024 •

edited by hiltontj

Tasks

hiltontj commented Apr 23, 2024 •

edited

pauldix commented Apr 23, 2024

pauldix commented May 8, 2024

Investigate _series_id model #24815

Investigate _series_id model #24815

Comments

pauldix commented Mar 25, 2024 • edited by hiltontj

Tasks

hiltontj commented Apr 23, 2024 • edited

#24845 - Support column type for storing bytes

#24919 - Benchmark sort/dedupe by tags vs. _series_id

pauldix commented Apr 23, 2024

pauldix commented May 8, 2024

Investigate `_series_id` model #24815

Investigate `_series_id` model #24815

pauldix commented Mar 25, 2024 •

edited by hiltontj

hiltontj commented Apr 23, 2024 •

edited

#24919 - Benchmark sort/dedupe by tags vs. `_series_id`