Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate _series_id model #24815

Closed
6 of 7 tasks
Tracked by #24650
pauldix opened this issue Mar 25, 2024 · 3 comments
Closed
6 of 7 tasks
Tracked by #24650

Investigate _series_id model #24815

pauldix opened this issue Mar 25, 2024 · 3 comments

Comments

@pauldix
Copy link
Member

pauldix commented Mar 25, 2024

I'm calling this the _series_id model of mapping LP to the 3.0 table model. The idea goes like this:

  • Every row has a _series_id which is a 32 byte SHA265 hash of the tagset (tags appear in lexicographical order)
  • Data is stored into Parquet files, deduped and sorted by _series_id
  • Tags columns just like fields. We keep the distinction in the metadata for compatibility with InfluxQL. They need not be dictionary encoded
  • String columns (which include tags) will only use dictionary encoding when being persisted if their cardinality is < N (need to figure out what a good value of N is)

My thinking is that this will make sorting easier and it'll keep it consistent across the table for all time, even when new columns are introduced. This will also give an entry point later to create indexes on other columns for value -> list of series ids. This could help with single or low cardinality series lookups.

This will also remove the mandatory dictionary encoding for all tag columns, which can be problematic with high cardinality tags.

Lastly, this is something that will be used in the 3.0 data model that I'm envisioning. So this will create a bridge between the 1.x/2.x data model and 3.x.

Here's a list of tasks I think we'll need to investigate this model:

Tasks

  1. epic/perf-prototyping v3
    hiltontj
  2. epic/perf-prototyping v3
    hiltontj
  3. 4 of 4
    epic/perf-prototyping v3
    hiltontj
  4. epic/perf-prototyping v3
    hiltontj
  5. epic/perf-prototyping v3
@pauldix pauldix added the v3 label Mar 25, 2024
This was referenced Mar 25, 2024
@hiltontj hiltontj changed the title Investigate _series_id model Investigate _series_id model Mar 25, 2024
@hiltontj hiltontj self-assigned this Apr 11, 2024
@hiltontj
Copy link
Contributor

hiltontj commented Apr 23, 2024

A fair bit of work has been done on the open sub-issues listed above. Much of the work is sitting on issues and branches/PRs that have not been merged, so I wanted to summarize that here.

#24845 - Support column type for storing bytes

#24919 - Benchmark sort/dedupe by tags vs. _series_id

@pauldix
Copy link
Member Author

pauldix commented Apr 23, 2024

Thanks for writing up the summary, that's very helpful. I'd like to sit on this for a few days. It's not clear that there's more to do here or that we should move forward with it, but I need to think on it for a bit.

@pauldix
Copy link
Member Author

pauldix commented May 8, 2024

I'm closing this out for now. I think that what the testing here showed overall was that there was some storage impact, which would be greatly mitigated by reshuffling the data so that a single series is clustered together in fewer Parquet files. Compaction was improved in cases where there were more than 2 tags using less CPU and RAM.

Originally, I thought we'd have to use this to build an index to do quick single series lookups, but I've found a better pathway to get to that. I also speculated that the gains for compaction would be greater than these tests have shown.

One of the other concerns I had about being able to consistently sort data over time using the series_id likely won't be a problem in the v3 data model.

Basically, the gains didn't seem impactful enough to continue down this path. We can revisit in the future if it seems like something that will be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants