Trying to implement partitioning of covered index to see partition filter in the plan, looking for suggestions #147

SynapsePOC · 2020-09-05T13:55:21Z

Describe the issue

Our dataset is in billions of rows. Partitioning is essential.
To retrieve physical rows we use forEachPartition, and fetch rows via REST call to row receptacle during row scan, without returning row to driver, using node / executor to scale the architecture.
We are looking for a way to ensure filtering of partitions to minimize number of partitions accessed by forEachPartition call. To verify index filtering we check query plans.
At the outset index seemed like a direct path to get this access path realized. However, it is currently supporting only bucketing of data. We have come to explore creating an view on top of covered index ( I believe this is what you mention in documentation as "indexed view" ). However, partitioning entails a set of experiments to wring out best optimized access path.

Any suggestions would be greatly appreciated.

As part of approach we have discovered relatively recent pull request implemented inclusion of partitioning key from data file into index. Upon closer look it would appear that index could benefit if partitioning column wasn't merely included into the index structure, but was used to mirror partitioning schema of the index itself, particularly for covered index.

Even in video presentation if you look closely to explain plan of an query provided for example of index improvement, the plan

appears with PartitionFilters element empty. When row count is very high, the ability to segregate processing amongst nodes via partitioning is essential.

In an ideal world it would look like the following.

An partitioned data set is saved in parquet, fetched to dataframe variable and preserved with createOrReplaceTempView.
As a next step an index is created, for example, for an "Employee" table partitioning could be by region where employee works with next level of index being city and than date of hire. With rest of fields included we get nice covered index.
When this index is created it should be distributed along the partitioning sections ( could be provision for additional properties in index description ). And a query which include region in the WHERE clause with rest of indexed fields to the right should be able to engage partition filtering optimization mechanism. This should isolate just the nodes aligned with required partitions to perform physical row retrieval ( where optimizer plan is concerned ). An subsequent call .forEachPartition should only access partitions matching to that of region ( partition key ) provided in the query WHERE clause, with query optimizer sending reminder of work to index buckets for efficient row - wise retrieval.
Perhaps there is an access path algorithm I can use now, to attain matching level of efficiency.
Any suggestion greatly appreciated.

SynapsePOC · 2020-09-05T23:25:36Z

I may have run over a step with my note.
I have re - red roadmap document ( just to re - check and realign with project pace ) and it looks like what I have requested is a step and a half ahead :)

If this is the case, please close or postpone as appropriate.
Thank you!

apoorvedave1 · 2020-09-09T01:58:06Z

Thanks @SynapsePOC , You are correct here, we are working on exploring ways to partition indexes to mirror the data partitioning. Currently the index is not partitioned with hive-partitioning schemes and there are no apis to support it. We have opened #108 for investigating this issue. Thanks again.

SynapsePOC added the untriaged This is the default tag for a newly created issue label Sep 5, 2020

apoorvedave1 added the duplicate This issue or pull request already exists label Sep 9, 2020

apoorvedave1 closed this as completed Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to implement partitioning of covered index to see partition filter in the plan, looking for suggestions #147

Trying to implement partitioning of covered index to see partition filter in the plan, looking for suggestions #147

SynapsePOC commented Sep 5, 2020 •

edited

SynapsePOC commented Sep 5, 2020

apoorvedave1 commented Sep 9, 2020

Trying to implement partitioning of covered index to see partition filter in the plan, looking for suggestions #147

Trying to implement partitioning of covered index to see partition filter in the plan, looking for suggestions #147

Comments

SynapsePOC commented Sep 5, 2020 • edited

Describe the issue

SynapsePOC commented Sep 5, 2020

apoorvedave1 commented Sep 9, 2020

SynapsePOC commented Sep 5, 2020 •

edited