Poor performance when running a large statement. sqlalchemy + oracledb #172

kryvokhyzha · 2023-04-19T12:52:37Z

Hi,
I have the following code:

import pandas as pd
from sqlalchemy.orm.session import Session


def compute(engine: Session) -> pd.DataFrame:
        # some steps
        # ....
        df = engine.query(...)

        with engine.get_bind().connect() as conn:
                df = pd.read_sql(sql=df.statement, con=conn)

        return df

If I compile SQL code from the df variable and then run it directly in DBeaver - this statement will complete in 17-20 seconds. And it's okay. Compiled SQL contains 347k characters.
If I run the calculation through pd.read_sql, then this code will run for about 16 minutes. I noticed that the delay happens in the _prepare method (see trace below).

File "/opt/homebrew/Caskroom/miniforge/base/envs/venv/lib/python3.10/site-packages/oracledb/cursor.py", line 137, in _prepare
    self._impl.prepare(statement, tag, cache_statement)
File "src/oracledb/impl/thin/cursor.pyx", line 213, in oracledb.thin_impl.ThinCursorImpl.prepare
File "src/oracledb/impl/thin/connection.pyx", line 235, in oracledb.thin_impl.ThinConnImpl._get_statement
File "src/oracledb/impl/thin/connection.pyx", line 239, in oracledb.thin_impl.ThinConnImpl._get_statement
File "src/oracledb/impl/thin/statement.pyx", line 184, in oracledb.thin_impl.Statement._prepare
File "/opt/homebrew/Caskroom/miniforge/base/envs/venv/lib/python3.10/re.py", line 209, in sub
  return _compile(pattern, flags).sub(repl, string, count)

At the same time, I can run simpler scripts via pd.read_sql and they execute normally.
Also, I build the same script like in df variable, but for PostgreSQL. And it run for about 35 seconds.

Issue

Is this expected behavior when working with large scripts?
Is there any way I can bypass this prepare step?

requirements.txt

# Apple M1 macOS Ventura 13.0
# python 3.10
SQLAlchemy==2.0.9
numpy==1.23.5
pandas==1.5.2
oracledb==1.3.0

Thank you!

The text was updated successfully, but these errors were encountered:

cjbj · 2023-04-19T21:32:02Z

Triple check you are connecting to the same DB in both scenarios!

Can you add a call to init_oracle_client() (to run in Thick mode), set the environment variable export DPI_DEBUG_LEVEL=16 (see here) before starting, and then check that the generated SQL statement being run is the same as you are running directly? Make sure you are using the same bind variables, if any, in both scenarios.

I don't know what has changed in Pandas 2, but it may be worth checking that version too.

kryvokhyzha · 2023-04-20T12:17:20Z

Triple check you are connecting to the same DB in both scenarios!

Can you add a call to init_oracle_client() (to run in Thick mode), set the environment variable export DPI_DEBUG_LEVEL=16 (see here) before starting, and then check that the generated SQL statement being run is the same as you are running directly? Make sure you are using the same bind variables, if any, in both scenarios.

I don't know what has changed in Pandas 2, but it may be worth checking that version too.

Directly launched script in DBeaver and through sqlalchemy+oracledb are executed on one DB and from one user, I checked it
The results of the scripts when running directly and via `sqlalchemy+oracledb' are the same. Only the execution time differs
Also, these results coincide with a similar script on postgres, which is also generated via sqlalchemy
Unfortunately, I cannot test in think mode, because my PC does not have an oracle client
Tested my code with pandas==2.0.0 - nothing changed
I decided to check whether the problem is really in the regular expressions, which placed in the oracledb.thin_impl.Statement._prepare method (File "src/oracledb/impl/thin/statement.pyx"). Got the following results:

import re
import time


sql = """
-- large sql script
"""

print(len(sql))  # 351449

start = time.time()
sql = re.sub(r"/\*[\S\n ]+?\*/", "", sql)
print(time.time() - start)  # 0.0004279613494873047

start = time.time()
sql = re.sub(r"\--.*(\n|$)", "", sql)
print(time.time() - start)  # 0.0004191398620605469

start = time.time()
sql = re.sub(r"""'[^']*'(?=(?:[^']*[^']*')*[^']*$)*""", "", sql,
             flags=re.MULTILINE)
print(time.time() - start)  # 3571.5018050670624

start = time.time()
sql = re.sub(r'(:\s*)?("([^"]*)")',
            lambda m: m.group(0) if sql[m.start(0)] == ":" else "",
            sql)
print(time.time() - start)  # 0.007905006408691406

So, the problem is precisely in the third regular expression, which is executed for a very long time

anthony-tuininga · 2023-04-20T13:35:56Z

That's helpful. Are you able to share the large SQL script? You can e-mail it to me if you prefer (anthony.tuininga@gmail.com). Some adjustments to the regular expressions are planned to avoid these problems!

that perform better (#172).

anthony-tuininga · 2023-04-20T22:48:03Z

I have pushed a patch that should correct this issue. If you are able to build from source you can verify that it corrects your issue as well.

kryvokhyzha · 2023-04-21T08:16:45Z

I have pushed a patch that should correct this issue. If you are able to build from source you can verify that it corrects your issue as well.

I have tested a patch that you pushed. Now code works really faster.
Thank you!

anthony-tuininga · 2023-04-24T20:45:25Z

This has been included in version 1.3.1 which was just released!

kryvokhyzha added the question Further information is requested label Apr 19, 2023

anthony-tuininga added a commit that referenced this issue Apr 20, 2023

Replaced regular expressions for parsing SQL with regular expressions

30ec927

that perform better (#172).

anthony-tuininga added the patch available label Apr 20, 2023

kryvokhyzha closed this as completed Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance when running a large statement. sqlalchemy + oracledb #172

Poor performance when running a large statement. sqlalchemy + oracledb #172

kryvokhyzha commented Apr 19, 2023 •

edited

Loading

cjbj commented Apr 19, 2023

kryvokhyzha commented Apr 20, 2023 •

edited

Loading

anthony-tuininga commented Apr 20, 2023

anthony-tuininga commented Apr 20, 2023

kryvokhyzha commented Apr 21, 2023

anthony-tuininga commented Apr 24, 2023

Poor performance when running a large statement. sqlalchemy + oracledb #172

Poor performance when running a large statement. sqlalchemy + oracledb #172

Comments

kryvokhyzha commented Apr 19, 2023 • edited Loading

Issue

requirements.txt

cjbj commented Apr 19, 2023

kryvokhyzha commented Apr 20, 2023 • edited Loading

anthony-tuininga commented Apr 20, 2023

anthony-tuininga commented Apr 20, 2023

kryvokhyzha commented Apr 21, 2023

anthony-tuininga commented Apr 24, 2023

kryvokhyzha commented Apr 19, 2023 •

edited

Loading

kryvokhyzha commented Apr 20, 2023 •

edited

Loading