Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-partitioning build-in feature #62

Open
muntdan opened this issue Mar 28, 2023 · 5 comments
Open

Auto-partitioning build-in feature #62

muntdan opened this issue Mar 28, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@muntdan
Copy link

muntdan commented Mar 28, 2023

Hi,

Timescaledb has the built-in auto partitioning on 1 or 2 dimension (time and tenant/deviceid/etc).

Your documentation states Partitioning to be used for time partitioning, but still it can be done only manually.
It would be great if you implement something for auto-creating the partitions based on configuration of intervals and/or secondary criteria. This can help a lot on the usecase of GDPR cleaning where dropping older data would be a lot faster/easier.

@wuputah
Copy link
Member

wuputah commented Mar 28, 2023

pg_partman can help with automating table-based partition management, but yes having user-configurable auto-partitioning is something we would like to add to the storage layer as well.

@wuputah wuputah added the enhancement New feature or request label Mar 28, 2023
@muntdan
Copy link
Author

muntdan commented May 24, 2023

Based on the fact that query as is on Hydra cannot cope with multiple query in timely fashion and that timescaledb released the version 2.11 with support for update/delete on compressed data, I have started comparing the 2 of them again from point of view of reading the data.
My results are as fallows on same volume of data 234 million rows:
Hydra: 1 Month Avg single query = 5 - 5.5 seconds / 7-8cores (25-32% from 24 core) / 2-8% IOPS
Timescaledb: 1 Month Avg single query = 1 - 1.5 seconds / 5-10cores (20-35% from 24 core) / 2-8% IOPS
with chunk_time_interval & compression set for 1 hour.

The reason I am writing this here on this issue is that I believe there is another config on my timescaledb that helps a lot for such good performance: space partitioning of which I do use on the above setup having configured 10 tenants that divide the customers on each table into 10 slices, so the actual query of a given client would hit only 1/10 of total chunks in order to run.

In conclusion I believe auto-partitioning feature requested here might also help a lot on the actual query performance from related issue and if simple division could be applied (probably not) => 1/10 x 5.5 seconds the equivalent performance would be of 0.5 seconds compared to Timescaledb.

@wuputah
Copy link
Member

wuputah commented May 24, 2023

In the near term, the native Postgres partitioning is the way to do this for Hydra. We don't plan to add this as a primitive to the storage layer for quite some time. We have pg_partman on the hosted product but it is really only needed if you want to do "auto time partitioning" - it helps create new partitions for every slice of time you define (e.g. week or month).

For "partition this into X slices by ID" you can use the "partition by hash" function. It's just a matter of writing the SQL, which is a bit repetitive but fortunately need only be done once. Here's an example:

CREATE TABLE customers (id INTEGER, status TEXT, arr NUMERIC) PARTITION BY HASH(id);
CREATE TABLE cust_part1 PARTITION OF customers FOR VALUES WITH (modulus 3, remainder 0) USING columnar;
CREATE TABLE cust_part2 PARTITION OF customers FOR VALUES WITH (modulus 3, remainder 1) USING columnar;
CREATE TABLE cust_part3 PARTITION OF customers FOR VALUES WITH (modulus 3, remainder 2) USING columnar;

The thing Timescale does here that's pretty tricky to do is partitioning by both time and dimension… in this case, it would be like partitioning by time first, and then by hash on ID. This is possible to do with native partitioning but it gets tricky to manage as there's many partitions to create, and I'm not aware that pg_partman helping with this case.

PostgreSQL multilevel partitions can be created up to N levels. Partition methods LIST-LIST, LIST-RANGE, LIST-HASH, RANGE-RANGE, RANGE-LIST, RANGE-HASH, HASH-HASH, HASH-LIST, and HASH-RANGE can be created in PostgreSQL declarative partitioning.

@muntdan
Copy link
Author

muntdan commented May 24, 2023

So if I define the 10 tenant manual partition on a columnar table and than insert the data via ETL every ~ 5 minutes in the main table, will the columnar stripe/chunks fallow the rules and the actual split the batch data into 10 chunks behind it ?
Also will the columnar query on the main table with where condition on the tenant will know which chunks to read ?

@wuputah
Copy link
Member

wuputah commented May 24, 2023

Yes to both.

The only caveat I can think of is that vacuum will need to be done on the individual partitions if you are trying to maintain optimal striping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants