Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas IO to_sql extremely slow when checking for potentially case sensitivity issues #12876

Closed
RogerThomas opened this issue Apr 12, 2016 · 8 comments
Labels
IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance
Milestone

Comments

@RogerThomas
Copy link
Contributor

This is related to #7815

Since this fix, when checking for case sensitivity issues for MySQL using InnoDB engine with large numbers of tables, Class SQLDatabase.to_sql can take a long time (I currently have ~30,000 tables and this call takes ~10 seconds).
This is due to the 'SHOW FULL TABLES' call coming from SQLAlchemy, from these lines of code specifically

engine = self.connectable.engine
with self.connectable.connect() as conn:
    table_names = engine.table_names(
        schema=schema or self.meta.schema,
        connection=conn,
     )

A suggested fix would be to add a parameter to the to_sql function.
The parameter might look something like, case_check='warn'
and the fix might be something like

if case_check == 'raise' and not name.islower():
    raise Exception("table name must be lower case when case_check = 'raise'")
current_code()
if case_check == 'warn':
    engine = self.connectable.engine
    with self.connectable.connect() as conn:
        table_names = engine.table_names(
            schema=schema or self.meta.schema,
            connection=conn,
         )

Versions:

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-83-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 18.2
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: None
sphinx: 1.3.1
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.0
numexpr: 2.4.6
matplotlib: 1.5.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.38.0

@jorisvandenbossche jorisvandenbossche added the IO SQL to_sql, read_sql, read_sql_query label Apr 12, 2016
@jorisvandenbossche
Copy link
Member

Thanks for reporting this, that is indeed a problem.
I think your proposal for an extra keyword is certainly acceptable (although I would keep it simpler and just have True/False, without the 'raise' option). The only problem with this is that it is also not that discoverable (most people who are bitten by a performance problem because of this will not always think of using this keyword without first digging in to find the reason).

@RogerThomas
Copy link
Contributor Author

No problem, thanks for looking into it.
That's a good point, perhaps a parameter force_lower_case is True by default then something like

if force_lower_case and not name.islower():
    raise Exception("table name must be lower case when force_lower_case is true'")
current_code()
if not force_lower_case:
    engine = self.connectable.engine
    with self.connectable.connect() as conn:
        table_names = engine.table_names(
            schema=schema or self.meta.schema,
            connection=conn,
         )

@jorisvandenbossche
Copy link
Member

Also just thinking, what we can do in any case (no backwards compat issues), is only doing this check (and thus retrieving all tables names) when the provided table name is not all lowercase. That's already a rather cheap solution for those cases.

So:

if not name.islower():
    if name not in self.engine.table_names(schema=schema or self.meta.schema):
            warnings.warn(...)

@jreback
Copy link
Contributor

jreback commented Apr 12, 2016

I am a bit -1 on having a very specific kw like this. Is there a way to NOT check, then fallback somehow (with a check if the tables are not found)?

@RogerThomas
Copy link
Contributor Author

@jorisvandenbossche that's a VERY simple and cheap solution and would fix the issues I'm having. I should've thought of it... Nice one

@jorisvandenbossche
Copy link
Member

The problem with something like force_lower_case is that this is not backwards compatible + for other databases case sensitivity can be much more sane to work with, so this would be too generic.

@jorisvandenbossche
Copy link
Member

@RogerThomas Want to do a PR for that one?
We can always further discuss other solutions.

@RogerThomas
Copy link
Contributor Author

@jreback @jorisvandenbossche, sure I take both your points. @jorisvandenbossche sure I'll put in a PR for that.

Thanks :)

@jreback jreback added the Performance Memory or execution speed performance label Apr 12, 2016
@jreback jreback added this to the 0.18.1 milestone Apr 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants