Pandas IO to_sql extremely slow when checking for potentially case sensitivity issues #12876

RogerThomas · 2016-04-12T09:55:44Z

This is related to #7815

Since this fix, when checking for case sensitivity issues for MySQL using InnoDB engine with large numbers of tables, Class SQLDatabase.to_sql can take a long time (I currently have ~30,000 tables and this call takes ~10 seconds).
This is due to the 'SHOW FULL TABLES' call coming from SQLAlchemy, from these lines of code specifically

engine = self.connectable.engine
with self.connectable.connect() as conn:
    table_names = engine.table_names(
        schema=schema or self.meta.schema,
        connection=conn,
     )

A suggested fix would be to add a parameter to the to_sql function.
The parameter might look something like, case_check='warn'
and the fix might be something like

if case_check == 'raise' and not name.islower():
    raise Exception("table name must be lower case when case_check = 'raise'")
current_code()
if case_check == 'warn':
    engine = self.connectable.engine
    with self.connectable.connect() as conn:
        table_names = engine.table_names(
            schema=schema or self.meta.schema,
            connection=conn,
         )

Versions:

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-83-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 18.2
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: None
sphinx: 1.3.1
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.0
numexpr: 2.4.6
matplotlib: 1.5.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.38.0

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-04-12T13:06:37Z

Thanks for reporting this, that is indeed a problem.
I think your proposal for an extra keyword is certainly acceptable (although I would keep it simpler and just have True/False, without the 'raise' option). The only problem with this is that it is also not that discoverable (most people who are bitten by a performance problem because of this will not always think of using this keyword without first digging in to find the reason).

RogerThomas · 2016-04-12T13:15:08Z

No problem, thanks for looking into it.
That's a good point, perhaps a parameter force_lower_case is True by default then something like

if force_lower_case and not name.islower():
    raise Exception("table name must be lower case when force_lower_case is true'")
current_code()
if not force_lower_case:
    engine = self.connectable.engine
    with self.connectable.connect() as conn:
        table_names = engine.table_names(
            schema=schema or self.meta.schema,
            connection=conn,
         )

jorisvandenbossche · 2016-04-12T13:18:12Z

Also just thinking, what we can do in any case (no backwards compat issues), is only doing this check (and thus retrieving all tables names) when the provided table name is not all lowercase. That's already a rather cheap solution for those cases.

So:

if not name.islower():
    if name not in self.engine.table_names(schema=schema or self.meta.schema):
            warnings.warn(...)

jreback · 2016-04-12T13:20:30Z

I am a bit -1 on having a very specific kw like this. Is there a way to NOT check, then fallback somehow (with a check if the tables are not found)?

RogerThomas · 2016-04-12T13:21:30Z

@jorisvandenbossche that's a VERY simple and cheap solution and would fix the issues I'm having. I should've thought of it... Nice one

jorisvandenbossche · 2016-04-12T13:21:37Z

The problem with something like force_lower_case is that this is not backwards compatible + for other databases case sensitivity can be much more sane to work with, so this would be too generic.

jorisvandenbossche · 2016-04-12T13:22:02Z

@RogerThomas Want to do a PR for that one?
We can always further discuss other solutions.

RogerThomas · 2016-04-12T13:22:29Z

@jreback @jorisvandenbossche, sure I take both your points. @jorisvandenbossche sure I'll put in a PR for that.

Thanks :)

jorisvandenbossche added the IO SQL to_sql, read_sql, read_sql_query label Apr 12, 2016

jreback added the Performance Memory or execution speed performance label Apr 12, 2016

RogerThomas mentioned this issue Apr 12, 2016

PERF: Only do case sensitive check when not lower case #12880

Closed

4 tasks

jreback added this to the 0.18.1 milestone Apr 13, 2016

jreback closed this as completed in 5a53f03 Apr 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas IO to_sql extremely slow when checking for potentially case sensitivity issues #12876

Pandas IO to_sql extremely slow when checking for potentially case sensitivity issues #12876

RogerThomas commented Apr 12, 2016

jorisvandenbossche commented Apr 12, 2016

RogerThomas commented Apr 12, 2016

jorisvandenbossche commented Apr 12, 2016

jreback commented Apr 12, 2016

RogerThomas commented Apr 12, 2016

jorisvandenbossche commented Apr 12, 2016

jorisvandenbossche commented Apr 12, 2016

RogerThomas commented Apr 12, 2016

Pandas IO to_sql extremely slow when checking for potentially case sensitivity issues #12876

Pandas IO to_sql extremely slow when checking for potentially case sensitivity issues #12876

Comments

RogerThomas commented Apr 12, 2016

INSTALLED VERSIONS

jorisvandenbossche commented Apr 12, 2016

RogerThomas commented Apr 12, 2016

jorisvandenbossche commented Apr 12, 2016

jreback commented Apr 12, 2016

RogerThomas commented Apr 12, 2016

jorisvandenbossche commented Apr 12, 2016

jorisvandenbossche commented Apr 12, 2016

RogerThomas commented Apr 12, 2016