You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.
Our dataset is in billions of rows. Partitioning is essential.
To retrieve physical rows we use forEachPartition, and fetch rows via REST call to row receptacle during row scan, without returning row to driver, using node / executor to scale the architecture.
We are looking for a way to ensure filtering of partitions to minimize number of partitions accessed by forEachPartition call. To verify index filtering we check query plans.
At the outset index seemed like a direct path to get this access path realized. However, it is currently supporting only bucketing of data. We have come to explore creating an view on top of covered index ( I believe this is what you mention in documentation as "indexed view" ). However, partitioning entails a set of experiments to wring out best optimized access path.
Any suggestions would be greatly appreciated.
As part of approach we have discovered relatively recent pull request implemented inclusion of partitioning key from data file into index. Upon closer look it would appear that index could benefit if partitioning column wasn't merely included into the index structure, but was used to mirror partitioning schema of the index itself, particularly for covered index.
Even in video presentation if you look closely to explain plan of an query provided for example of index improvement, the plan
appears with PartitionFilters element empty. When row count is very high, the ability to segregate processing amongst nodes via partitioning is essential.
In an ideal world it would look like the following.
An partitioned data set is saved in parquet, fetched to dataframe variable and preserved with createOrReplaceTempView.
As a next step an index is created, for example, for an "Employee" table partitioning could be by region where employee works with next level of index being city and than date of hire. With rest of fields included we get nice covered index.
When this index is created it should be distributed along the partitioning sections ( could be provision for additional properties in index description ). And a query which include region in the WHERE clause with rest of indexed fields to the right should be able to engage partition filtering optimization mechanism. This should isolate just the nodes aligned with required partitions to perform physical row retrieval ( where optimizer plan is concerned ). An subsequent call .forEachPartition should only access partitions matching to that of region ( partition key ) provided in the query WHERE clause, with query optimizer sending reminder of work to index buckets for efficient row - wise retrieval.
Perhaps there is an access path algorithm I can use now, to attain matching level of efficiency.
Any suggestion greatly appreciated.
The text was updated successfully, but these errors were encountered:
I may have run over a step with my note.
I have re - red roadmap document ( just to re - check and realign with project pace ) and it looks like what I have requested is a step and a half ahead :)
If this is the case, please close or postpone as appropriate.
Thank you!
Thanks @SynapsePOC , You are correct here, we are working on exploring ways to partition indexes to mirror the data partitioning. Currently the index is not partitioned with hive-partitioning schemes and there are no apis to support it. We have opened #108 for investigating this issue. Thanks again.
Describe the issue
Our dataset is in billions of rows. Partitioning is essential.
To retrieve physical rows we use forEachPartition, and fetch rows via REST call to row receptacle during row scan, without returning row to driver, using node / executor to scale the architecture.
We are looking for a way to ensure filtering of partitions to minimize number of partitions accessed by forEachPartition call. To verify index filtering we check query plans.
At the outset index seemed like a direct path to get this access path realized. However, it is currently supporting only bucketing of data. We have come to explore creating an view on top of covered index ( I believe this is what you mention in documentation as "indexed view" ). However, partitioning entails a set of experiments to wring out best optimized access path.
Any suggestions would be greatly appreciated.
As part of approach we have discovered relatively recent pull request implemented inclusion of partitioning key from data file into index. Upon closer look it would appear that index could benefit if partitioning column wasn't merely included into the index structure, but was used to mirror partitioning schema of the index itself, particularly for covered index.
Even in video presentation if you look closely to explain plan of an query provided for example of index improvement, the plan
appears with PartitionFilters element empty. When row count is very high, the ability to segregate processing amongst nodes via partitioning is essential.
In an ideal world it would look like the following.
An partitioned data set is saved in parquet, fetched to dataframe variable and preserved with createOrReplaceTempView.
As a next step an index is created, for example, for an "Employee" table partitioning could be by region where employee works with next level of index being city and than date of hire. With rest of fields included we get nice covered index.
When this index is created it should be distributed along the partitioning sections ( could be provision for additional properties in index description ). And a query which include region in the WHERE clause with rest of indexed fields to the right should be able to engage partition filtering optimization mechanism. This should isolate just the nodes aligned with required partitions to perform physical row retrieval ( where optimizer plan is concerned ). An subsequent call .forEachPartition should only access partitions matching to that of region ( partition key ) provided in the query WHERE clause, with query optimizer sending reminder of work to index buckets for efficient row - wise retrieval.
Perhaps there is an access path algorithm I can use now, to attain matching level of efficiency.
Any suggestion greatly appreciated.
The text was updated successfully, but these errors were encountered: