-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: regression: slow to generate SQL, generated SQL is much slower #8484
Comments
I can get around this regression by throwing in more intermediate .cache()s, but
The code I have that generates that expression isn't shareable for a repro, but I think it would be valuable to
I can help with task 1, but 2 and 3 should probably be done by a maintainer. |
Do you have a reproducer perhaps? If not could you post the |
sorry I posted more info in a followup comment. No repro on hand so I will have to make one. Since I have to do that do you have any sort of guidelines on it? eg as I'm trying to choose a scale for it, it should take ~.1 seconds to compile on 8.0.0? |
on
on 8.0.0:
|
If you can post the timings for each of the different versions that would be helpful. |
@cpcloud what timings do you mean? In the original post I have the timings for compiling the SQL and for .cache()ing it. The user code that generates the expressions is the same between 8.0.0 and main. |
Ah I see the timings now in a comment. ~1sec for 8.0 |
ah, just to be clear: I started with |
@NickCrews could you try to call |
I suspect the CTE generation, performance of generating SQL and the execution performance are all related. If we're repeating the same subquery multiple times, the database may not automatically extract them into CTEs. We've already seen this with SQLite. |
Let me actually craft some user code that can repro this. Then it will be much easier for you to understand and tweak, and it can get turned into a benchmark. |
OK, here is a notebook that can repro the issue, and some test data of names (from the Alaska Division of Elections voterfile, this is public data) |
It looks like y'all are doing some work on this. I really appreciate that, this looks tricky. I'm curious how y'all are thinking about the goal here. My thoughts would be that we should be aiming to generate SQL that is
How are y'all thinking about this? This probably should actually get written down somewhere in the docs as the official stance of ibis, this feels important and it will impact all sorts of other decisions. ie if "fast as possible to execute" isn't the number 1 priority, some people might be unwilling to use Ibis, and might just hand-roll their own SQL. |
I think you've got the priorities in the correct order. In this particular case, my hunch is that the first two priorities are related: By failing to recursively extract CTEs, we duplicate subqueries which leads to more code to generate (and thus slower to generate said code) and engines that do not perform common expression elimination are doomed to execute identical queries more than once. |
Trino is the only backend I'm aware of that executes CTEs once for each reference (i.e., equivalent to repeating the subquery) |
Agree that we should write this down somewhere, perhaps in one of the |
Thanks Nick for sharing, I am able to reproduce. Currently inspecting what happens, the ibis IR and the rewrite just before the sqlglot translation is instant, so something related to the sqlglot object creation. The memory also explodes, so this is an important issue to fix. Just throwing the profiling result here: 33.892 <module> slow.py:1
└─ 33.892 to_sql ibis/expr/sql.py:342
├─ 17.801 transpile sqlglot/__init__.py:133
│ [1558 frames hidden] sqlglot
└─ 16.077 Backend._to_sql ibis/backends/sql/__init__.py:124
└─ 16.076 Backend.compile ibis/backends/sql/__init__.py:115
├─ 10.697 Backend._to_sqlglot ibis/backends/duckdb/__init__.py:106
│ └─ 10.697 Backend._to_sqlglot ibis/backends/sql/__init__.py:92
│ └─ 10.697 DuckDBCompiler.translate ibis/backends/sql/compiler.py:422
│ └─ 10.628 Select.map ibis/common/graph.py:232
│ └─ 10.622 fn ibis/backends/sql/compiler.py:455
│ └─ 10.621 DuckDBCompiler.visit_node ibis/backends/sql/compiler.py:494
│ ├─ 5.161 DuckDBCompiler.impl ibis/backends/sql/compiler.py:350
│ │ └─ 5.160 <lambda> ibis/backends/sql/compiler.py:97
│ │ └─ 5.160 func sqlglot/expressions.py:6874
│ │ [162 frames hidden] sqlglot, copy
│ ├─ 4.496 DuckDBCompiler.visit_Select ibis/backends/sql/compiler.py:1101
│ │ ├─ 2.340 Select.from_ sqlglot/expressions.py:2812
│ │ │ [201 frames hidden] sqlglot, copy
│ │ └─ 2.156 DuckDBCompiler._dedup_name ibis/backends/sql/compiler.py:1074
│ │ └─ 2.156 If.as_ sqlglot/expressions.py:735
│ │ [239 frames hidden] sqlglot, copy
│ └─ 0.380 DuckDBCompiler.visit_Coalesce ibis/backends/sql/compiler.py:956
│ └─ 0.380 <lambda> ibis/backends/sql/compiler.py:97
│ └─ 0.380 func sqlglot/expressions.py:6874
│ [37 frames hidden] sqlglot, copy
└─ 5.379 Select.sql sqlglot/expressions.py:513
[276 frames hidden] sqlglot, copy |
My guess that every time we construct a sqlglot expression we create a full copy of the previous sqlglot expression, hence the memory explode. The sqlglot function and methods have a |
agree on a concepts article -- perhaps replacing the current one on internals: https://ibis-project.org/concepts/internals |
…reate a sqlglot object (#8592) Still profiling and adding more `copy=False` options, apparently it greatly improves the performance. According to profiling the recursive generation of sqlglot is still a bottleneck for big queries which cannot be fixed on the ibis side. There could be one option though to compile the fragments in a greedy fashion which are going to be cached by `Node.map()` and inject those as arbitrary strings to other sqlglot expressions. Now it uses 30% of the memory it used before. Apparently there are still a lot of sqlglot Literal objects around, looking into that. fixes #8484 --------- Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com>
Can we re-open this? This is currently the number one blocker for me, my app has basically become unusable due to this. Is there any temp workaround I can do on my ibis fork that will make CTEs get extracted in the short term? |
@NickCrews the real bottleneck is sqlglot here. Now we avoid excessive sqlglot deepcopying as well as generating non pretty sql by default which made the overall compilation much better. I can think of a single additional optimization, which would eagerly compile the ibis expression to sql strings and use those strings in other sqlglot expressions ensuring that the same expression compiled only once. I would suggest to raise an issue in sqlglot upstream. In the meantime we may get more information about better usage or possibly optimizations for sqlglot. |
Gotcha. If I understand correctly, even if we do that, then there still won't be CTEs extracted. I guess, but haven't benchmarked, that the actual execution was the cause of my slowdown, since we moved from having a lot of CTEs to no CTEs. Did you investigate this when you were benchmarking above, or should I? If we find the execution is the bottleneck, would you consider going back to the CTE-heavy format? |
Well, we do extract CTEs for unique selections occurring more than once, see the implementation here: ibis/ibis/backends/sql/rewrites.py Lines 195 to 207 in 0aceefc
|
Thinking about a possible "extreme" solution here: what if we make every |
I do this sort of thing a lot in my processing code: import ibis
t = ibis.examples.penguins.fetch()
t = t.mutate(species2=t.species.upper())
t = t.mutate(species3=t.species2.strip())
ibis.to_sql(t) eg transform some expression, then transform that result again SELECT
"t0"."species",
"t0"."island",
"t0"."bill_length_mm",
"t0"."bill_depth_mm",
"t0"."flipper_length_mm",
"t0"."body_mass_g",
"t0"."sex",
"t0"."year",
UPPER("t0"."species") AS "species2",
TRIM(UPPER("t0"."species"), '
�') AS "species3"
FROM "penguins" AS "to" Note how the Note that this might not actually be a problem for execution, at least duckdb appears to be smart enough to only execute this once? import ibis
import pyarrow.compute as pc
i = 0
@ibis.udf.scalar.pyarrow
def plus1(x: int) -> int:
global i
i += 1
return pc.add(x, 1)
t = ibis.examples.penguins.fetch().head(10)
t = t.mutate(body_mass_g2=plus1(t.body_mass_g))
t = t.mutate(body_mass_g3=t.body_mass_g2 * 2)
t.execute()
print(i) # shows 1 But even if execution is smart enough, perhaps this is also the cause of the SQL generation slowdown, since I think this would be a
Sorry I don't think I follow, I think we are suggesting the same thing? eg WITH t99 AS (
SELECT
"penguins"."species",
UPPER("penguins"."species") AS "species2",
)
SELECT
"t99"."species",
"t99"."species2",
TRIM("t99"."species2", '
�') AS "species3" |
OK, here is actually a problematic case where duckdb isn't smart enough: import ibis
a = ibis.array([1, 2, 3])
i = 0
@ibis.udf.scalar.python
def plus1(x: int) -> int:
global i
i += 1
return x + 1
b = a.map(plus1).unique()
b.execute()
print(I) # 24 SELECT
CASE
WHEN LIST_APPLY(
[CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
__ibis_param_x__ -> PLUS1_5(__ibis_param_x__)
) IS NULL
THEN NULL
ELSE LIST_DISTINCT(
LIST_APPLY(
[CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
__ibis_param_x__ -> PLUS1_5(__ibis_param_x__)
)
) + CASE
WHEN LIST_COUNT(
LIST_APPLY(
[CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
__ibis_param_x__ -> PLUS1_5(__ibis_param_x__)
)
) < LENGTH(
LIST_APPLY(
[CAST(1 AS TINYINT), CAST(2 AS TINYINT), CAST(3 AS TINYINT)],
__ibis_param_x__ -> PLUS1_5(__ibis_param_x__)
)
)
THEN [NULL]
ELSE []
END
END AS "ArrayDistinct(ArrayMap(Array(), plus1_5(x)))" I don't quite understand why it is 24, I would expect 3*4=12 times. Perhaps this existed before the sqlglot refactor and I just didn't notice. But it seems like the same sort of question of "under what circumstances should we extract CTEs?" |
All bets are off with global mutable state. It's not the same problem, because the function is impure. There's no reasonable way we can write a function that would determine whether global state is mutated. If you mutate state like that you're 100% at the mercy of the query engine, its calling convention, and probably its optimizer. |
If I pass |
oh, I see I think you are talking about the feasibility of doing the optimization on the ibis side. Hmmm that is trickier. What about for non-UDFs, can we assume that all non-UDFs are pure and thus we can extract them? |
Wait, I'm not sure that the pureness of a function decide our behavior here. If someone writes a UDF (either pure or impure), we don't make that promise anywhere in our docs as to the number of times that will get called, and I don't think we should. Therefore, we have the freedom to call it as many times as we want, and I think the natural preference for us should be to call it as few times as possible. |
The number of times a function gets called is definitely not guaranteed, especially since Ibis is effectively never the caller. The only thing that should be guaranteed is the correct result. I probably should have said something like: global state introduces dependencies on how something is done, and that will lead to invalid assumptions about function calls in |
I think we're getting way into the weeds without a clear indication of what the problem is: Right now we know the following:
In theory, we can fix number one without fixing number 2 (which wouldn't be a useful thing to do IMO). @kszucs We still need to address this issue, and it's still not clear how addressing sqlglot's sql generation speed would help address the execution speed. |
It may have been slow before, I may just not have noticed this problem. But regardless, I think this is still something that we need to address. |
Well, now I'm confused :) I thought you had something that was working (and reasonably fast) that is now unusable. |
ugh sorry yes I'm being confusing. Indeed, if you try out the repro example I shared above (with a few steps commented out to make it actually finish on main) on 8.0.0:
on main:
The thing that is the same in 8.0.0 vs main is that there are still some subexpressions that are repeated many times, when they possibly could/should be extracted into a single instance. But there are just many many more repeats on main. |
Resolved by #8825 |
What happened?
I have some very complicated ibis expression.
On 8.0.0:
and I get the sql
On main (b5e6373) I get
and I get the sql (but actually much longer, but github won't let me post something that long), but you can see how this uses nested expressions, not CTEs:
In actuality, I used to have an even more complex expression, but that literally took minutes to compile to SQL and I had to ctrl-c before it even finished. The .cache() also took longer than I had patience for. Then I threw some .cache()s in the middle of that chain to break it up, and this gave me the expression I have here. If I throw more .cache()s into the chain, the difference between the timings shrinks more.
What version of ibis are you using?
b5e6373 vs 8.0.0
What backend(s) are you using, if any?
duckdb
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: