New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"too many SQL variables" Error with pandas 0.23 - enable multivalues insert #19664 issue #21103

Closed
tripkane opened this Issue May 17, 2018 · 31 comments

Comments

Projects
None yet
8 participants
@tripkane

tripkane commented May 17, 2018

#let's import packages to use
import numpy as np
import pandas as pd
from sqlalchemy import create_engine


# #use pandas to import data
df = pd.DataFrame(np.arange(0,20000,1))
#create the engine to connect pandas with sqlite3
engine = create_engine('sqlite://')
#create connection
conn = engine.connect()
#convert df to sql table
df.to_sql('test',engine, if_exists='replace',chunksize=1000)
#print results
result = conn.execute("select * from test")
for row in result:
    print(row['index'])
conn.close()

Problem description

In pandas 0.22 I could write a dataframe to sql of reasonable size without error. Now I receive this error "OperationalError: (sqlite3.OperationalError) too many SQL variables". I am converting a dataframe with ~20k+ rows to sql. After looking around I suspect the problem lies in the limit set by sqlite3: SQLITE_MAX_VARIABLE_NUMBER which is set to 999 by default (based on their docs). This can apparently be changed by recompiling sqlite and adjusting this variable accordingly. I can confirm that for a df of length (rows) 499 this works. I can also confirm that this test version works with a row length of 20k and a chunksize of 499 inputted with df_to_sql works. In my real case the limit is 76. These numbers are clearly dependent on data size of each row so a method is required to estimate this based on data type and number of columns. #19664

Expected #Output

runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

####Trace###########

runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')
Traceback (most recent call last):

File "", line 1, in
runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "H:/Tests/Pandas_0.23_test.py", line 19, in
df.to_sql('test',engine, if_exists='fail',chunksize=500)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 2127, in to_sql
dtype=dtype)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 450, in to_sql
chunksize=chunksize, dtype=dtype)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 1149, in to_sql
table.insert(chunksize)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 663, in insert
self._execute_insert(conn, keys, chunk_iter)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 638, in _execute_insert
conn.execute(*self.insert_statement(data, conn))

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 948, in execute
return meth(self, multiparams, params)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\sql\elements.py", line 269, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1060, in _execute_clauseelement
compiled_sql, distilled_params

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1200, in _execute_context
context)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1413, in _handle_dbapi_exception
exc_info

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 203, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 186, in reraise
raise value.with_traceback(tb)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1193, in _execute_context
context)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 507, in do_execute
cursor.execute(statement, parameters)

OperationalError: (sqlite3.OperationalError) too many SQL variables [SQL: 'INSERT INTO test ("index", "0") VALUES (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?,

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@kristang

This comment has been minimized.

kristang commented May 17, 2018

This is also an issue when connecting to MSSQL. It fails silently and will not stop the function or return an exception, but will throw one if you set "chunksize" manually.

@tripkane

This comment has been minimized.

tripkane commented May 17, 2018

@kristang: I just played around a bit more. Have you tried a chunksize limit of 75 or less? My case is working for the local db based on this chunksize and strangely 499 for a sqlite db in memory, neither of which are the limits stated by sqlite. Have you tried different chunksizes in your case?

@kristang

This comment has been minimized.

kristang commented May 17, 2018

I tried 999, 1000 and 2100, but then just ended up downgrading to 0.22 to get on with my day.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 17, 2018

@tripkane could I bother you to narrow down the original example a bit? It should be minimal, and something that's appropriate for sticking in a unit test.

Could you also include the traceback you get?

@tripkane

This comment has been minimized.

tripkane commented May 17, 2018

@kristang: :) fair enough maybe try these lower numbers this out when you have time

@tripkane

This comment has been minimized.

tripkane commented May 17, 2018

@TomAugspurger: Hi Tom. Yep can do. So reduce it to my actual case (i.e. locally saved database i.e. not in memory and create the data from np.arange like the :memory: version?)

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 17, 2018

@tripkane

This comment has been minimized.

tripkane commented May 17, 2018

@TomAugspurger: I have shortened it and updated the comments. Basically chunksize must be used and one needs to estimate this based on the data size of one row based on each individual case. This person implemented this solution with peewee: https://stackoverflow.com/questions/35616602/peewee-operationalerror-too-many-sql-variables-on-upsert-of-only-150-rows-8-c anyway I hope this helps and for those with this issue you can't go wrong in the meantime with a chunksize of something very small like 1 :)

@tripkane

This comment has been minimized.

tripkane commented May 17, 2018

@TomAugspurger: After further testing the initial assumption seems correct and is based on the SQLITE_MAX_VARIABLE_NUMBER limit of 999. The max allowable chunksize associated with pd.df_to_sql is given by chunksize=999//(cols+1) where cols was generated for the test by cols = np.random.randint(1,20) and used to create df = pd.DataFrame(np.random.random((20000, cols)))

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 17, 2018

@tripkane thanks for the minimal example! I can confirm this regression.

@tripkane

This comment has been minimized.

tripkane commented May 17, 2018

@jorisvandenbossche: You're welcome, happy to help.

@schettino72

This comment has been minimized.

Contributor

schettino72 commented May 18, 2018

Since you guys are looking into this... Can you guys check if the change that caused this regression actually improved performance?

As I reported on #8953 this change actually made the operation slower for me. And IMO should be reverted...

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 18, 2018

@tripkane

This comment has been minimized.

tripkane commented May 18, 2018

@TomAugspurger: I tested this quickly with essentially the same example as above for two different length df's 20k and 200k and compared chunksize = 1 with chunksize=999//(cols+1) with cols = 10 and in both cases it was twice as fast when connected to a local sqlite db.

@schettino72

This comment has been minimized.

Contributor

schettino72 commented May 21, 2018

@TomAugspurger I created issue #21146 to track this performance regression.
I did not profile it but did some experiments... my conclusion is that performance on 0.23 is better for a few columns but degrades quickly as the number of columns grow.

@cpcloud

This comment has been minimized.

Member

cpcloud commented May 22, 2018

@pandas-dev/pandas-core How do people feel about reverting #19664? It both breaks existing code and degrades performance significantly. Happy to put up the patch.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 22, 2018

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 22, 2018

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 22, 2018

In the original issue (#8953), it was discussed to have an optional keyword for this, as which default would be best depends a lot on the exact circumstances (this discussion was largely ignored in the PR, sorry about not raising this in the PR).
So I think making it into an option with the default how it was before, is certainly a possible way forward.

I think it would still be useful for someone with some time to go through the discussion in that issue to check the raises usecases, how it interacts with chunksize, etc, as I don't recall everything that was discussed there.

@cpcloud

This comment has been minimized.

Member

cpcloud commented May 22, 2018

IMO pandas should allow more flexibility here: users pass in a function that will get called with the SQLAlchemy table and they can construct the DML themselves. The default DML construction function would be the previous working code. Then any such optimization that comes up in the future doesn't need yet another keyword argument. I wonder if there isn't already a way to do this without monkey patching.

@schettino72

This comment has been minimized.

Contributor

schettino72 commented May 24, 2018

@cpcloud As of today there is no way to achieve this without monkey patching.
I use a monkey patched version to use COPY ... FROM.

I could work on patch to make this more flexible. I suggest adding a parameter insert_method to to_sql(). Apart from the default (previous) implementation, we could add multi-value (the 0.23 version) and COPY FROM.

schettino72 added a commit to schettino72/pandas that referenced this issue May 25, 2018

ENH: 'to_sql()' add param 'method' to control insert statement (panda…
…s-dev#21103)

Also revert default insert method to NOT use multi-value.
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 28, 2018

The way without monkeypatching is subclassing the table class and overriding the appropriate method, but that is not necessarily cleaner than monkeypatching as it private methods you are overriding anyway.

I like the proposed idea of allowing more flexibility. But we should think a bit about the exact interface we want (what does a user need to create the appropriate construct? The sqlalchemy table, the data (in which form?), the connection, ...).

@schettino72

This comment has been minimized.

Contributor

schettino72 commented May 29, 2018

@jorisvandenbossche

The way without monkeypatching is subclassing the table class and overriding the appropriate method, but that is not necessarily cleaner than monkeypatching as it private methods you are overriding anyway.

to_sql() has no direct control of which table class will be used. So currently, subclassing is not enough, you MUST monkeypatch.

Also you would need to be careful with subclassing the table class as there are 2 implementations: SQLAlchemy and SQLite.

Another problem of monkeypatching/subclassing (opposed to parameter) is that the same method has to be used globally by an application, and can not be fine tuned for each call of to_sql().

But we should think a bit about the exact interface we want (what does a user need to create the appropriate construct? The sqlalchemy table, the data (in which form?), the connection, ...).

I think just follow the API already used for _execute_insert(self, conn, keys, data_iter)

  • a reference to table object
  • a connection
  • list of keys/columns
  • and data_iter with values

That was enough for me to implement postgresql COPY without having to repeat Pandas specific processing of columns and conversion of values.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 30, 2018

to_sql() has no direct control of which table class will be used. So currently, subclassing is not enough, you MUST monkeypatch.

Yeah, for to_sql that is true, but in principle you can subclass and then use the methods of that class. With emphasis on "can", as I also don't really recommend it as I said in my previous comment.

Anyway, let's leave that discussion as I think we agree that we need some way to customize it (at least to switch between the single / multi values to solve the regression).

I think just follow the API already used for _execute_insert(self, conn, keys, data_iter)

Would there be occasions where a user would want the actual dataframe instead of data_iter?

@cpcloud what do you think of this interface?

@schettino72 One option (for the PR) would also be to implement this parameter, but leave the postgres-copy one as an example of how to use this (to not add an option that only works for a single engine for now; although given its superior speed, we should really think how to deal with this).

@danfrankj

This comment has been minimized.

Contributor

danfrankj commented May 31, 2018

Hi all! My apologies for the regression introduced. I like the solution in #21199 much better.

For context, the motivation for needing a multivalues insert was to interact with a presto database. In that case, a normal insert occurs row by row incurring a network call thus making it prohibitively slow. This may be the case for other analytic databases as well (redshift, bigquery).

But definitely we should not have this be the default so thank you @schettino72 for cleaning this up and again apologies.

@danfrankj

This comment has been minimized.

Contributor

danfrankj commented May 31, 2018

Just verified that multivalues has this behavior for Redshift as well.

In [60]: df = pd.DataFrame({'foo': np.arange(100)})

In [61]: conn.dialect
Out[61]: <sqlalchemy.dialects.postgresql.psycopg2.PGDialect_psycopg2 at 0x11d8e9748>

In [62]: conn.dialect.supports_multivalues_insert = False

In [63]: %time df.to_sql('test_no_multivalues_insert', con=conn, schema='default', index=False)
CPU times: user 21.4 ms, sys: 11.6 ms, total: 32.9 ms
Wall time: 38.7 s

In [64]: conn.dialect.supports_multivalues_insert = True

In [65]: %time df.to_sql('test_multivalues_insert', con=conn, schema='default', index=False)
CPU times: user 7.88 ms, sys: 2.38 ms, total: 10.3 ms
Wall time: 4.26 s
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Jun 6, 2018

@schettino72 do you have time to work on the PR?
Other short-term option for 0.23.1 is to revert the original change, and target the PR for 0.24.0

@schettino72

This comment has been minimized.

Contributor

schettino72 commented Jun 6, 2018

@schettino72 do you have time to work on the PR?

Sure. I was waiting for more feedback on API. From my side I guess just need to update docstrings and docs...

Should I target 0.24 or 0.23.1?

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Jun 6, 2018

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Jun 6, 2018

Yes, we certainly want to do something for 0.23.1 (at a minimum reverse). But it would be nice to at once get the option in there as well.
I am only not fully sure about the function interface (in case this needs more discussion for the exact API of the passed function). Maybe we could limit it for 0.23.1 only to adding method='default'/'multi', and leave the ability to pass a function for 0.24.0

@gnfrazier

This comment has been minimized.

gnfrazier commented Sep 13, 2018

Had the initial issue using 23.0 updating to 23.4 corrected the problem. Thanks for the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment