Skip to content
This repository has been archived by the owner on Jan 10, 2023. It is now read-only.

Optimize join ordering in N-table joins where N > 2 #41

Open
arthurhsu opened this issue May 12, 2015 · 3 comments
Open

Optimize join ordering in N-table joins where N > 2 #41

arthurhsu opened this issue May 12, 2015 · 3 comments

Comments

@arthurhsu
Copy link
Contributor

Join ordering is a pretty heavy concept in query optimization. The available approaches can get very involved and maybe outside the scope of Lovefield.

As an example of different levels of addressing the issue consider the case of a 3 vs 4 table join.
A 3 table join graph has always the same chain structure as A -> B-> C, where edges in this graph represent a join condition between (leftTable, rightTable). On the other hand, In a 4 table join case the graph can have different structure (chain, cycle, star, clique etc), and therefore optimizing such graphs is a harder problem.

Need to find the extend to which Lovefield should address the join ordering optimization issue. There are simplifications that can be made, for example considering left-deep only trees.

At the very least, for the 3table join case, there should be no unnecessary cross-product operations, which with our current approach is not guaranteed (depends on the order of the tables as supplied in the query).

@agershun
Copy link

I think you can start with of cost function which depends on the number of estimated records in each table and the sequence of these table (and may be other parameters). One approach - simply model it with multiple samples runs (with different order of joined tables, and then process it with neural network or any other machine learning methods to esimate this function coefficients.

I suppose that this function will be different for Lovefield and other databases, because the cost of joining is different for different engines. If you would like we can compare database engine coefficients on the different sets of data.

The problem is interesting, I will try to model this task and send you results.

@agershun
Copy link

BTW You are talking about INNER JOINs only, right? Because other JOINs are non-commutative, AFAIK.

@agershun
Copy link

I tried to model the situation with four joined tables and compare results in direct and reversed order.
Please, see this test file

    SELECT * 
      FROM one
      INNER JOIN two ON one.b = two.b
      INNER JOIN three ON two.c = three.c
      INNER JOIN four ON three.d = four.d;

   SELECT * 
      FROM four 
      INNER JOIN three ON three.d = four.d
      INNER JOIN two ON two.c = three.c
      INNER JOIN one ON one.b = two.b;

The hypothesis was: direct order is faster if number of records in first two tables more than next two records. The probability of positive test was about 60%. It does not worth for special optimization (of course, for these paticular kind of joins).

Of course, there are many factors, which affects on the result, and preindexation is the first one.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants