Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query optimization for LazyTbl join #346

Closed
ivdim1999 opened this issue Oct 14, 2021 · 10 comments
Closed

Query optimization for LazyTbl join #346

ivdim1999 opened this issue Oct 14, 2021 · 10 comments
Labels
type:feature New feature or request

Comments

@ivdim1999
Copy link

I have been experimenting with various verbs and operations on LazyTbl. When doing joins it caught my eye that the joined tables themselves enter as subqueries that select everything from them.

My code:

c = siuba.sql.LazyTbl(engine, "contract")
ct = siuba.sql.LazyTbl(engine, "contract_type")
c >> left_join(t, ct, on={'type': 'type'}) >> select(t.section, t.code, t.value1) >> show_query()

The SQL produced:

SELECT anon_1.section, 
       anon_1.code, 
       anon_2.value1 
FROM   (SELECT contract.section    AS section, 
               contract.code       AS code, 
               contract.start_date AS start_date, 
               contract.end_date   AS end_date, 
               contract.type       AS type 
        FROM   contract) AS anon_1 
       LEFT OUTER JOIN (SELECT contract_type.type   AS type, 
                               contract_type.value1 AS value1, 
                               contract_type.value2 AS value2, 
                               contract_type.value3 AS value3 
                        FROM   contract_type) AS anon_2 
                    ON anon_1.type = anon_2.type 

I imagined initially that I'll get something like:

SELECT contract.section, 
       contract.code, 
       contract_type.value1 
FROM   contract 
       LEFT OUTER JOIN contract_type 
                    ON contract.type = contract_type.type 

The subqueries in the first sql can be omitted but siuba doesn't realize it. This prompted me to look at how join is implemented in siuba/sql/verbs.py for lazy tables. The two instances that seem to be joined in the end are:

# Needs to be on the table, not the select
left_sel = left.last_op.alias()
right_sel = right.last_op.alias()

So I also looked at the implementation of LazyTbl in siuba/sql/verbs.py. Here the default operation is select in the list self.ops = [self.tbl.select()] if ops is None else ops. Which leads me to believe that by default the join uses this method with the subqueries that select everything from the joined instances and it's also probably not just an issue with the translation in show_query().

Is this problem circumventable in some way right now by using a workaround of some sorts (as in most cases having those subqueries is completely unnecessary)?

@machow
Copy link
Owner

machow commented Oct 14, 2021

Hey! There are some places where siuba knows to not create an extra subquery (for example, certain forms of mutate), but in other places the exact construction of raw subqueries hasn't been optimized. I think in most cases it shouldn't affect performance, as query optimizers are pretty good at this stuff!

But there is definite room for improvement. I have a refactor on the stable branch that I think improved some aspects of composing queries.

I have tried to code rules for simplifying in a few places, like the col_expression_requires_cte function. But as I recall replacing the initial select with just the table (or more likely pulling the table out of a simple select in a join) should work.

Would you be open to branching off stable and taking a pass? Happy to review--these kinds of simplifications to the generated SQL are really useful!

@ivdim1999
Copy link
Author

I'm happy to give it a go. Also I agree that most query optimizers shouldn't have issues with subqueries sql but it's less aesthetic that way, especially with many nested joins as is the case for some queries that I run. Additionally, is there an option to set table aliases as currently they're defaulted to anon_<something>, which could be confusing sometimes. If not, I could try to take a look into that as well.

@machow
Copy link
Owner

machow commented Oct 14, 2021

I experimented with radically simplifying certain queries here (e.g. using * when appropriate), if you want to try it out! It uses the show_query(simplify=True) argument!

#344

I think the name of anonymous tables is set by sqlalchemy, so wonder if it's possible to set there?

If you hit any snags don't hesitate to let me know. I'm happy to clarify anything!

@ivdim1999
Copy link
Author

Thanks, I'll look into how sqlalchemy sets the default aliases then.

@ivdim1999
Copy link
Author

How do you feel about having an extra parameter in the join function for example that would support aliasing of tables. It could be set to None by default, which would prompt the default sqlalchemy.alias(), but the user could provide 2 aliases that could be used as well. It seems pretty easy to include and I don't think it will mess with anything else.

@ivdim1999
Copy link
Author

Although I guess if it were to be added there, it would need to be added everywhere for consistency.

@machow
Copy link
Owner

machow commented Oct 15, 2021

Are you saying a parameter to set the name (or prefix for the name) of an aliased table? SQLAlchemy handles the building of a query, and the compiling of it to text. I suspect setting the name of anonymous tables might happen when compiling.

(See this line of the bigquery dialect for an example: https://github.com/googleapis/python-bigquery-sqlalchemy/blob/main/sqlalchemy_bigquery/base.py#L181)

@machow
Copy link
Owner

machow commented Oct 15, 2021

Ah, I'm seeing the argument to set alias name you mentioned in the method. I would hold off on adding an argument for the alias name to all the functions. The verbs are focused on doing the analysis, but methods like show_query(), or a SQLAlchemy visitor could always change alias names if useful!

@ivdim1999
Copy link
Author

Ok, it would seem to be a slight hassle to add that to all the verbs as well as probably a questionable design choice (as you pointed out it's better to keep their job focused on doing the analysis). I'll take a look into show_query() in that regard then.

@machow
Copy link
Owner

machow commented Nov 16, 2022

This show_query should now be much more user friendly, since it no longer wraps each table in an initial select:

from siuba.data import cars_sql
from siuba import *

cars_sql >> inner_join(_, cars_sql, "cyl") >> select(_.cyl, _.mpg_x) >> show_query()

output:

SELECT cars_1.cyl, cars_1.mpg AS mpg_x
FROM cars AS cars_1 JOIN cars AS cars_2 ON cars_1.cyl = cars_2.cyl

@machow machow closed this as completed Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or request
Projects
Status: Done
Development

No branches or pull requests

2 participants