Current predicate on indexed column is not pushed down to connector level. We should provide some API to push it down to connector. Hive/Pig has indexed column predicate push down feature which push down the query to us connector level indexes. e.g. Cassandra's secondary index on columns
Did you need to access the whole spectrum of possible predicates on indexed columns? Or only range predicates? Currently we pass in range predicate information to the connectors when generating partitions, and thereby splits from partitions.
For a Cassandra table example.
CREATE TABLE test(key_id int primary key, b int );
CREATE INDEX index_b on test(b);
Presto query select * from test where b =100;
Current presto retrieves all partitions based on primary key key_id. The following query is the final query pushed down to Cassandra connector
select * from test where token(key_id) > > [start_token] and token(key_id) < [end_token]
It then filters out result by b =100.
If we can push down the indexed column predicate query to Cassandra, we only need select the partitions by the following query
select * from test where b = 1000 and token(key_id) > [start_token] and token(key_id) < [end_token]
Can you elaborate how range predicate push down work?
I believe the range predicate push down already does this for you. But it may have to be supported properly in the Cassandra connector. If you go to the ConnectorSplitManager class, you will see that when you getPartitions, a TupleDomain is provided that defines all of the range predicates that Presto was able to extract/infer from the query syntax. At this point the Cassandra connector should be able to generate partitions that are aware of these ranges and then only generate splits that respect these predicates (so as to not to produce unnecessary data).
Is anyone working on a mysql connector or know of anyone working on one? Is there any information on building connectors?
ahh, I will check out TupleDomain object to update Cassandra connector.
looks like this will fix #1033