Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Trying to implement partitioning of covered index to see partition filter in the plan, looking for suggestions #147

Closed
SynapsePOC opened this issue Sep 5, 2020 · 2 comments
Labels
duplicate This issue or pull request already exists untriaged This is the default tag for a newly created issue

Comments

@SynapsePOC
Copy link

SynapsePOC commented Sep 5, 2020

Describe the issue

Our dataset is in billions of rows. Partitioning is essential.
To retrieve physical rows we use forEachPartition, and fetch rows via REST call to row receptacle during row scan, without returning row to driver, using node / executor to scale the architecture.
We are looking for a way to ensure filtering of partitions to minimize number of partitions accessed by forEachPartition call. To verify index filtering we check query plans.
At the outset index seemed like a direct path to get this access path realized. However, it is currently supporting only bucketing of data. We have come to explore creating an view on top of covered index ( I believe this is what you mention in documentation as "indexed view" ). However, partitioning entails a set of experiments to wring out best optimized access path.

Any suggestions would be greatly appreciated.

As part of approach we have discovered relatively recent pull request implemented inclusion of partitioning key from data file into index. Upon closer look it would appear that index could benefit if partitioning column wasn't merely included into the index structure, but was used to mirror partitioning schema of the index itself, particularly for covered index.

Even in video presentation if you look closely to explain plan of an query provided for example of index improvement, the plan

image

appears with PartitionFilters element empty. When row count is very high, the ability to segregate processing amongst nodes via partitioning is essential.

In an ideal world it would look like the following.

An partitioned data set is saved in parquet, fetched to dataframe variable and preserved with createOrReplaceTempView.
As a next step an index is created, for example, for an "Employee" table partitioning could be by region where employee works with next level of index being city and than date of hire. With rest of fields included we get nice covered index.
When this index is created it should be distributed along the partitioning sections ( could be provision for additional properties in index description ). And a query which include region in the WHERE clause with rest of indexed fields to the right should be able to engage partition filtering optimization mechanism. This should isolate just the nodes aligned with required partitions to perform physical row retrieval ( where optimizer plan is concerned ). An subsequent call .forEachPartition should only access partitions matching to that of region ( partition key ) provided in the query WHERE clause, with query optimizer sending reminder of work to index buckets for efficient row - wise retrieval.
Perhaps there is an access path algorithm I can use now, to attain matching level of efficiency.
Any suggestion greatly appreciated.

@SynapsePOC SynapsePOC added the untriaged This is the default tag for a newly created issue label Sep 5, 2020
@SynapsePOC
Copy link
Author

I may have run over a step with my note.
I have re - red roadmap document ( just to re - check and realign with project pace ) and it looks like what I have requested is a step and a half ahead :)
image
If this is the case, please close or postpone as appropriate.
Thank you!

@apoorvedave1
Copy link
Contributor

Thanks @SynapsePOC , You are correct here, we are working on exploring ways to partition indexes to mirror the data partitioning. Currently the index is not partitioned with hive-partitioning schemes and there are no apis to support it. We have opened #108 for investigating this issue. Thanks again.

@apoorvedave1 apoorvedave1 added the duplicate This issue or pull request already exists label Sep 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
duplicate This issue or pull request already exists untriaged This is the default tag for a newly created issue
Projects
None yet
Development

No branches or pull requests

2 participants