Query optimization for LazyTbl join #346

ivdim1999 · 2021-10-14T08:52:01Z

I have been experimenting with various verbs and operations on LazyTbl. When doing joins it caught my eye that the joined tables themselves enter as subqueries that select everything from them.

My code:

c = siuba.sql.LazyTbl(engine, "contract")
ct = siuba.sql.LazyTbl(engine, "contract_type")
c >> left_join(t, ct, on={'type': 'type'}) >> select(t.section, t.code, t.value1) >> show_query()

The SQL produced:

SELECT anon_1.section, 
       anon_1.code, 
       anon_2.value1 
FROM   (SELECT contract.section    AS section, 
               contract.code       AS code, 
               contract.start_date AS start_date, 
               contract.end_date   AS end_date, 
               contract.type       AS type 
        FROM   contract) AS anon_1 
       LEFT OUTER JOIN (SELECT contract_type.type   AS type, 
                               contract_type.value1 AS value1, 
                               contract_type.value2 AS value2, 
                               contract_type.value3 AS value3 
                        FROM   contract_type) AS anon_2 
                    ON anon_1.type = anon_2.type

I imagined initially that I'll get something like:

SELECT contract.section, 
       contract.code, 
       contract_type.value1 
FROM   contract 
       LEFT OUTER JOIN contract_type 
                    ON contract.type = contract_type.type

The subqueries in the first sql can be omitted but siuba doesn't realize it. This prompted me to look at how join is implemented in siuba/sql/verbs.py for lazy tables. The two instances that seem to be joined in the end are:

# Needs to be on the table, not the select
left_sel = left.last_op.alias()
right_sel = right.last_op.alias()

So I also looked at the implementation of LazyTbl in siuba/sql/verbs.py. Here the default operation is select in the list self.ops = [self.tbl.select()] if ops is None else ops. Which leads me to believe that by default the join uses this method with the subqueries that select everything from the joined instances and it's also probably not just an issue with the translation in show_query().

Is this problem circumventable in some way right now by using a workaround of some sorts (as in most cases having those subqueries is completely unnecessary)?

The text was updated successfully, but these errors were encountered:

machow · 2021-10-14T20:56:46Z

Hey! There are some places where siuba knows to not create an extra subquery (for example, certain forms of mutate), but in other places the exact construction of raw subqueries hasn't been optimized. I think in most cases it shouldn't affect performance, as query optimizers are pretty good at this stuff!

But there is definite room for improvement. I have a refactor on the stable branch that I think improved some aspects of composing queries.

I have tried to code rules for simplifying in a few places, like the col_expression_requires_cte function. But as I recall replacing the initial select with just the table (or more likely pulling the table out of a simple select in a join) should work.

Would you be open to branching off stable and taking a pass? Happy to review--these kinds of simplifications to the generated SQL are really useful!

ivdim1999 · 2021-10-14T21:19:05Z

I'm happy to give it a go. Also I agree that most query optimizers shouldn't have issues with subqueries sql but it's less aesthetic that way, especially with many nested joins as is the case for some queries that I run. Additionally, is there an option to set table aliases as currently they're defaulted to anon_<something>, which could be confusing sometimes. If not, I could try to take a look into that as well.

machow · 2021-10-14T21:48:28Z

I experimented with radically simplifying certain queries here (e.g. using * when appropriate), if you want to try it out! It uses the show_query(simplify=True) argument!

#344

I think the name of anonymous tables is set by sqlalchemy, so wonder if it's possible to set there?

If you hit any snags don't hesitate to let me know. I'm happy to clarify anything!

ivdim1999 · 2021-10-15T08:05:25Z

Thanks, I'll look into how sqlalchemy sets the default aliases then.

ivdim1999 · 2021-10-15T12:28:36Z

How do you feel about having an extra parameter in the join function for example that would support aliasing of tables. It could be set to None by default, which would prompt the default sqlalchemy.alias(), but the user could provide 2 aliases that could be used as well. It seems pretty easy to include and I don't think it will mess with anything else.

ivdim1999 · 2021-10-15T12:36:07Z

Although I guess if it were to be added there, it would need to be added everywhere for consistency.

machow · 2021-10-15T13:23:42Z

Are you saying a parameter to set the name (or prefix for the name) of an aliased table? SQLAlchemy handles the building of a query, and the compiling of it to text. I suspect setting the name of anonymous tables might happen when compiling.

(See this line of the bigquery dialect for an example: https://github.com/googleapis/python-bigquery-sqlalchemy/blob/main/sqlalchemy_bigquery/base.py#L181)

machow · 2021-10-15T13:27:38Z

Ah, I'm seeing the argument to set alias name you mentioned in the method. I would hold off on adding an argument for the alias name to all the functions. The verbs are focused on doing the analysis, but methods like show_query(), or a SQLAlchemy visitor could always change alias names if useful!

ivdim1999 · 2021-10-15T13:30:20Z

Ok, it would seem to be a slight hassle to add that to all the verbs as well as probably a questionable design choice (as you pointed out it's better to keep their job focused on doing the analysis). I'll take a look into show_query() in that regard then.

machow · 2022-11-16T18:17:19Z

This show_query should now be much more user friendly, since it no longer wraps each table in an initial select:

from siuba.data import cars_sql
from siuba import *

cars_sql >> inner_join(_, cars_sql, "cyl") >> select(_.cyl, _.mpg_x) >> show_query()

output:

SELECT cars_1.cyl, cars_1.mpg AS mpg_x
FROM cars AS cars_1 JOIN cars AS cars_2 ON cars_1.cyl = cars_2.cyl

ivdim1999 mentioned this issue Oct 15, 2021

[WIP] Query optimization for JOINs to not default to SELECT subqueries (for LazyTbl) #347

Closed

machow added the type:feature New feature or request label Jan 5, 2022

machow mentioned this issue Sep 28, 2022

feat(sql): add LazyTbl.last_select to simplify queries #449

Merged

machow closed this as completed Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query optimization for LazyTbl join #346

Query optimization for LazyTbl join #346

ivdim1999 commented Oct 14, 2021

machow commented Oct 14, 2021

ivdim1999 commented Oct 14, 2021

machow commented Oct 14, 2021 •

edited

ivdim1999 commented Oct 15, 2021

ivdim1999 commented Oct 15, 2021

ivdim1999 commented Oct 15, 2021

machow commented Oct 15, 2021

machow commented Oct 15, 2021

ivdim1999 commented Oct 15, 2021

machow commented Nov 16, 2022 •

edited

Query optimization for LazyTbl join #346

Query optimization for LazyTbl join #346

Comments

ivdim1999 commented Oct 14, 2021

machow commented Oct 14, 2021

ivdim1999 commented Oct 14, 2021

machow commented Oct 14, 2021 • edited

ivdim1999 commented Oct 15, 2021

ivdim1999 commented Oct 15, 2021

ivdim1999 commented Oct 15, 2021

machow commented Oct 15, 2021

machow commented Oct 15, 2021

ivdim1999 commented Oct 15, 2021

machow commented Nov 16, 2022 • edited

machow commented Oct 14, 2021 •

edited

machow commented Nov 16, 2022 •

edited