Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance when running a large statement. sqlalchemy + oracledb #172

Closed
kryvokhyzha opened this issue Apr 19, 2023 · 6 comments
Closed
Labels
patch available question Further information is requested

Comments

@kryvokhyzha
Copy link

kryvokhyzha commented Apr 19, 2023

Hi,
I have the following code:

import pandas as pd
from sqlalchemy.orm.session import Session


def compute(engine: Session) -> pd.DataFrame:
        # some steps
        # ....
        df = engine.query(...)

        with engine.get_bind().connect() as conn:
                df = pd.read_sql(sql=df.statement, con=conn)

        return df
  1. If I compile SQL code from the df variable and then run it directly in DBeaver - this statement will complete in 17-20 seconds. And it's okay. Compiled SQL contains 347k characters.
  2. If I run the calculation through pd.read_sql, then this code will run for about 16 minutes. I noticed that the delay happens in the _prepare method (see trace below).
File "/opt/homebrew/Caskroom/miniforge/base/envs/venv/lib/python3.10/site-packages/oracledb/cursor.py", line 137, in _prepare
    self._impl.prepare(statement, tag, cache_statement)
File "src/oracledb/impl/thin/cursor.pyx", line 213, in oracledb.thin_impl.ThinCursorImpl.prepare
File "src/oracledb/impl/thin/connection.pyx", line 235, in oracledb.thin_impl.ThinConnImpl._get_statement
File "src/oracledb/impl/thin/connection.pyx", line 239, in oracledb.thin_impl.ThinConnImpl._get_statement
File "src/oracledb/impl/thin/statement.pyx", line 184, in oracledb.thin_impl.Statement._prepare
File "/opt/homebrew/Caskroom/miniforge/base/envs/venv/lib/python3.10/re.py", line 209, in sub
  return _compile(pattern, flags).sub(repl, string, count)
  1. At the same time, I can run simpler scripts via pd.read_sql and they execute normally.
  2. Also, I build the same script like in df variable, but for PostgreSQL. And it run for about 35 seconds.

Issue

  1. Is this expected behavior when working with large scripts?
  2. Is there any way I can bypass this prepare step?

requirements.txt

# Apple M1 macOS Ventura 13.0
# python 3.10
SQLAlchemy==2.0.9
numpy==1.23.5
pandas==1.5.2
oracledb==1.3.0

Thank you!

@kryvokhyzha kryvokhyzha added the question Further information is requested label Apr 19, 2023
@cjbj
Copy link
Member

cjbj commented Apr 19, 2023

Triple check you are connecting to the same DB in both scenarios!

Can you add a call to init_oracle_client() (to run in Thick mode), set the environment variable export DPI_DEBUG_LEVEL=16 (see here) before starting, and then check that the generated SQL statement being run is the same as you are running directly? Make sure you are using the same bind variables, if any, in both scenarios.

I don't know what has changed in Pandas 2, but it may be worth checking that version too.

@kryvokhyzha
Copy link
Author

kryvokhyzha commented Apr 20, 2023

Triple check you are connecting to the same DB in both scenarios!

Can you add a call to init_oracle_client() (to run in Thick mode), set the environment variable export DPI_DEBUG_LEVEL=16 (see here) before starting, and then check that the generated SQL statement being run is the same as you are running directly? Make sure you are using the same bind variables, if any, in both scenarios.

I don't know what has changed in Pandas 2, but it may be worth checking that version too.

  1. Directly launched script in DBeaver and through sqlalchemy+oracledb are executed on one DB and from one user, I checked it
  2. The results of the scripts when running directly and via `sqlalchemy+oracledb' are the same. Only the execution time differs
  3. Also, these results coincide with a similar script on postgres, which is also generated via sqlalchemy
  4. Unfortunately, I cannot test in think mode, because my PC does not have an oracle client
  5. Tested my code with pandas==2.0.0 - nothing changed
  6. I decided to check whether the problem is really in the regular expressions, which placed in the oracledb.thin_impl.Statement._prepare method (File "src/oracledb/impl/thin/statement.pyx"). Got the following results:
import re
import time


sql = """
-- large sql script
"""

print(len(sql))  # 351449

start = time.time()
sql = re.sub(r"/\*[\S\n ]+?\*/", "", sql)
print(time.time() - start)  # 0.0004279613494873047

start = time.time()
sql = re.sub(r"\--.*(\n|$)", "", sql)
print(time.time() - start)  # 0.0004191398620605469

start = time.time()
sql = re.sub(r"""'[^']*'(?=(?:[^']*[^']*')*[^']*$)*""", "", sql,
             flags=re.MULTILINE)
print(time.time() - start)  # 3571.5018050670624

start = time.time()
sql = re.sub(r'(:\s*)?("([^"]*)")',
            lambda m: m.group(0) if sql[m.start(0)] == ":" else "",
            sql)
print(time.time() - start)  # 0.007905006408691406
  1. So, the problem is precisely in the third regular expression, which is executed for a very long time

@anthony-tuininga
Copy link
Member

That's helpful. Are you able to share the large SQL script? You can e-mail it to me if you prefer (anthony.tuininga@gmail.com). Some adjustments to the regular expressions are planned to avoid these problems!

@anthony-tuininga
Copy link
Member

I have pushed a patch that should correct this issue. If you are able to build from source you can verify that it corrects your issue as well.

@kryvokhyzha
Copy link
Author

I have pushed a patch that should correct this issue. If you are able to build from source you can verify that it corrects your issue as well.

I have tested a patch that you pushed. Now code works really faster.
Thank you!

@anthony-tuininga
Copy link
Member

This has been included in version 1.3.1 which was just released!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
patch available question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants