"too many SQL variables" Error with pandas 0.23 - enable multivalues insert #19664 issue #21103

tripkane · 2018-05-17T13:26:21Z

#let's import packages to use
import numpy as np
import pandas as pd
from sqlalchemy import create_engine


# #use pandas to import data
df = pd.DataFrame(np.arange(0,20000,1))
#create the engine to connect pandas with sqlite3
engine = create_engine('sqlite://')
#create connection
conn = engine.connect()
#convert df to sql table
df.to_sql('test',engine, if_exists='replace',chunksize=1000)
#print results
result = conn.execute("select * from test")
for row in result:
    print(row['index'])
conn.close()

Problem description

In pandas 0.22 I could write a dataframe to sql of reasonable size without error. Now I receive this error "OperationalError: (sqlite3.OperationalError) too many SQL variables". I am converting a dataframe with ~20k+ rows to sql. After looking around I suspect the problem lies in the limit set by sqlite3: SQLITE_MAX_VARIABLE_NUMBER which is set to 999 by default (based on their docs). This can apparently be changed by recompiling sqlite and adjusting this variable accordingly. I can confirm that for a df of length (rows) 499 this works. I can also confirm that this test version works with a row length of 20k and a chunksize of 499 inputted with df_to_sql works. In my real case the limit is 76. These numbers are clearly dependent on data size of each row so a method is required to estimate this based on data type and number of columns. #19664

Expected #Output

runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

####Trace###########

runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')
Traceback (most recent call last):

File "", line 1, in
runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "H:/Tests/Pandas_0.23_test.py", line 19, in
df.to_sql('test',engine, if_exists='fail',chunksize=500)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 2127, in to_sql
dtype=dtype)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 450, in to_sql
chunksize=chunksize, dtype=dtype)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 1149, in to_sql
table.insert(chunksize)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 663, in insert
self._execute_insert(conn, keys, chunk_iter)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 638, in _execute_insert
conn.execute(*self.insert_statement(data, conn))

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 948, in execute
return meth(self, multiparams, params)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\sql\elements.py", line 269, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1060, in _execute_clauseelement
compiled_sql, distilled_params

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1200, in _execute_context
context)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1413, in _handle_dbapi_exception
exc_info

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 203, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 186, in reraise
raise value.with_traceback(tb)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1193, in _execute_context
context)

File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 507, in do_execute
cursor.execute(statement, parameters)

OperationalError: (sqlite3.OperationalError) too many SQL variables [SQL: 'INSERT INTO test ("index", "0") VALUES (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?,

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

kristang · 2018-05-17T13:35:44Z

This is also an issue when connecting to MSSQL. It fails silently and will not stop the function or return an exception, but will throw one if you set "chunksize" manually.

tripkane · 2018-05-17T13:40:52Z

@kristang: I just played around a bit more. Have you tried a chunksize limit of 75 or less? My case is working for the local db based on this chunksize and strangely 499 for a sqlite db in memory, neither of which are the limits stated by sqlite. Have you tried different chunksizes in your case?

kristang · 2018-05-17T13:46:33Z

I tried 999, 1000 and 2100, but then just ended up downgrading to 0.22 to get on with my day.

TomAugspurger · 2018-05-17T13:47:51Z

@tripkane could I bother you to narrow down the original example a bit? It should be minimal, and something that's appropriate for sticking in a unit test.

Could you also include the traceback you get?

tripkane · 2018-05-17T13:49:33Z

@kristang: :) fair enough maybe try these lower numbers this out when you have time

tripkane · 2018-05-17T13:56:23Z

@TomAugspurger: Hi Tom. Yep can do. So reduce it to my actual case (i.e. locally saved database i.e. not in memory and create the data from np.arange like the :memory: version?)

TomAugspurger · 2018-05-17T14:00:31Z

In-memory is fine if it reproduces the error. http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports may be helpful.

…

On Thu, May 17, 2018 at 8:56 AM, tripkane ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger>: Hi Tom. Yep can do. So reduce it to my actual case (i.e. locally saved database i.e. not in memory and create the data from np.arange like the :memory: version?) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21103 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIqaA8WdrT73TquzvHzb6vuZNSnmGks5tzYGQgaJpZM4UDECk> .

tripkane · 2018-05-17T14:12:40Z

@TomAugspurger: I have shortened it and updated the comments. Basically chunksize must be used and one needs to estimate this based on the data size of one row based on each individual case. This person implemented this solution with peewee: https://stackoverflow.com/questions/35616602/peewee-operationalerror-too-many-sql-variables-on-upsert-of-only-150-rows-8-c anyway I hope this helps and for those with this issue you can't go wrong in the meantime with a chunksize of something very small like 1 :)

tripkane · 2018-05-17T14:45:35Z

@TomAugspurger: After further testing the initial assumption seems correct and is based on the SQLITE_MAX_VARIABLE_NUMBER limit of 999. The max allowable chunksize associated with pd.df_to_sql is given by chunksize=999//(cols+1) where cols was generated for the test by cols = np.random.randint(1,20) and used to create df = pd.DataFrame(np.random.random((20000, cols)))

jorisvandenbossche · 2018-05-17T15:56:43Z

@tripkane thanks for the minimal example! I can confirm this regression.

tripkane · 2018-05-17T16:43:24Z

@jorisvandenbossche: You're welcome, happy to help.

schettino72 · 2018-05-18T08:12:29Z

Since you guys are looking into this... Can you guys check if the change that caused this regression actually improved performance?

As I reported on #8953 this change actually made the operation slower for me. And IMO should be reverted...

TomAugspurger · 2018-05-18T10:45:07Z

Best to open a separate issue to track that. Could you do some profiling as well, to see where the slowdown is?

…

On Fri, May 18, 2018 at 3:12 AM, eduardo naufel schettino < ***@***.***> wrote: Since you guys are looking into this... Can you guys check if the change that caused this regression actually improved performance? As I reported on #8953 <#8953> this change actually made the operation slower for me. And IMO should be reverted... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21103 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIo7h4aUvpJQF_pqhEpe0xB7sSZQsks5tzoJ2gaJpZM4UDECk> .

tripkane · 2018-05-18T21:33:21Z

@TomAugspurger: I tested this quickly with essentially the same example as above for two different length df's 20k and 200k and compared chunksize = 1 with chunksize=999//(cols+1) with cols = 10 and in both cases it was twice as fast when connected to a local sqlite db.

schettino72 · 2018-05-21T05:39:13Z

@TomAugspurger I created issue #21146 to track this performance regression.
I did not profile it but did some experiments... my conclusion is that performance on 0.23 is better for a few columns but degrades quickly as the number of columns grow.

cpcloud · 2018-05-22T21:07:45Z

@pandas-dev/pandas-core How do people feel about reverting #19664? It both breaks existing code and degrades performance significantly. Happy to put up the patch.

TomAugspurger · 2018-05-22T21:13:14Z

Reverting seems reasonable for now. Is there anything we can salvage form #19664 so that users can opt into multi-row inserts if they want?

…

On Tue, May 22, 2018 at 4:07 PM, Phillip Cloud ***@***.***> wrote: @pandas-dev/pandas-core <https://github.com/orgs/pandas-dev/teams/pandas-core> How do people feel about reverting #19664 <#19664>? It both breaks existing code and degrades performance significantly. Happy to put up the patch. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21103 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHItx5fraPyzUV6092imwIAO7NLIpcks5t1H4ogaJpZM4UDECk> .

TomAugspurger · 2018-05-22T21:13:31Z

cc @danfrankj

jorisvandenbossche · 2018-05-22T21:23:52Z

In the original issue (#8953), it was discussed to have an optional keyword for this, as which default would be best depends a lot on the exact circumstances (this discussion was largely ignored in the PR, sorry about not raising this in the PR).
So I think making it into an option with the default how it was before, is certainly a possible way forward.

I think it would still be useful for someone with some time to go through the discussion in that issue to check the raises usecases, how it interacts with chunksize, etc, as I don't recall everything that was discussed there.

cpcloud · 2018-05-22T22:20:16Z

IMO pandas should allow more flexibility here: users pass in a function that will get called with the SQLAlchemy table and they can construct the DML themselves. The default DML construction function would be the previous working code. Then any such optimization that comes up in the future doesn't need yet another keyword argument. I wonder if there isn't already a way to do this without monkey patching.

schettino72 · 2018-05-24T17:07:21Z

@cpcloud As of today there is no way to achieve this without monkey patching.
I use a monkey patched version to use COPY ... FROM.

I could work on patch to make this more flexible. I suggest adding a parameter insert_method to to_sql(). Apart from the default (previous) implementation, we could add multi-value (the 0.23 version) and COPY FROM.

…s-dev#21103) Also revert default insert method to NOT use multi-value.

jorisvandenbossche · 2018-05-28T08:04:36Z

The way without monkeypatching is subclassing the table class and overriding the appropriate method, but that is not necessarily cleaner than monkeypatching as it private methods you are overriding anyway.

I like the proposed idea of allowing more flexibility. But we should think a bit about the exact interface we want (what does a user need to create the appropriate construct? The sqlalchemy table, the data (in which form?), the connection, ...).

schettino72 · 2018-05-29T08:42:04Z

@jorisvandenbossche

The way without monkeypatching is subclassing the table class and overriding the appropriate method, but that is not necessarily cleaner than monkeypatching as it private methods you are overriding anyway.

to_sql() has no direct control of which table class will be used. So currently, subclassing is not enough, you MUST monkeypatch.

Also you would need to be careful with subclassing the table class as there are 2 implementations: SQLAlchemy and SQLite.

Another problem of monkeypatching/subclassing (opposed to parameter) is that the same method has to be used globally by an application, and can not be fine tuned for each call of to_sql().

But we should think a bit about the exact interface we want (what does a user need to create the appropriate construct? The sqlalchemy table, the data (in which form?), the connection, ...).

I think just follow the API already used for _execute_insert(self, conn, keys, data_iter)

a reference to table object
a connection
list of keys/columns
and data_iter with values

That was enough for me to implement postgresql COPY without having to repeat Pandas specific processing of columns and conversion of values.

jorisvandenbossche · 2018-05-30T21:58:33Z

to_sql() has no direct control of which table class will be used. So currently, subclassing is not enough, you MUST monkeypatch.

Yeah, for to_sql that is true, but in principle you can subclass and then use the methods of that class. With emphasis on "can", as I also don't really recommend it as I said in my previous comment.

Anyway, let's leave that discussion as I think we agree that we need some way to customize it (at least to switch between the single / multi values to solve the regression).

I think just follow the API already used for _execute_insert(self, conn, keys, data_iter)

Would there be occasions where a user would want the actual dataframe instead of data_iter?

@cpcloud what do you think of this interface?

@schettino72 One option (for the PR) would also be to implement this parameter, but leave the postgres-copy one as an example of how to use this (to not add an option that only works for a single engine for now; although given its superior speed, we should really think how to deal with this).

danfrankj · 2018-05-31T01:00:19Z

Hi all! My apologies for the regression introduced. I like the solution in #21199 much better.

For context, the motivation for needing a multivalues insert was to interact with a presto database. In that case, a normal insert occurs row by row incurring a network call thus making it prohibitively slow. This may be the case for other analytic databases as well (redshift, bigquery).

But definitely we should not have this be the default so thank you @schettino72 for cleaning this up and again apologies.

danfrankj · 2018-05-31T01:28:34Z

Just verified that multivalues has this behavior for Redshift as well.

In [60]: df = pd.DataFrame({'foo': np.arange(100)})

In [61]: conn.dialect
Out[61]: <sqlalchemy.dialects.postgresql.psycopg2.PGDialect_psycopg2 at 0x11d8e9748>

In [62]: conn.dialect.supports_multivalues_insert = False

In [63]: %time df.to_sql('test_no_multivalues_insert', con=conn, schema='default', index=False)
CPU times: user 21.4 ms, sys: 11.6 ms, total: 32.9 ms
Wall time: 38.7 s

In [64]: conn.dialect.supports_multivalues_insert = True

In [65]: %time df.to_sql('test_multivalues_insert', con=conn, schema='default', index=False)
CPU times: user 7.88 ms, sys: 2.38 ms, total: 10.3 ms
Wall time: 4.26 s

jorisvandenbossche · 2018-06-06T13:17:01Z

@schettino72 do you have time to work on the PR?
Other short-term option for 0.23.1 is to revert the original change, and target the PR for 0.24.0

schettino72 · 2018-06-06T13:21:22Z

@schettino72 do you have time to work on the PR?

Sure. I was waiting for more feedback on API. From my side I guess just need to update docstrings and docs...

Should I target 0.24 or 0.23.1?

TomAugspurger · 2018-06-06T13:59:22Z

I think 0.23.1

…

On Wed, Jun 6, 2018 at 8:21 AM, eduardo naufel schettino < ***@***.***> wrote: @schettino72 <https://github.com/schettino72> do you have time to work on the PR? Sure. I was waiting for more feedback on API. From my side I guess just need to update docstrings and docs... Should I target 0.24 or 0.23.1? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21103 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIlJZq1B39D6rZYw-_FFYs4NoazEiks5t59dagaJpZM4UDECk> .

jorisvandenbossche · 2018-06-06T15:01:07Z

Yes, we certainly want to do something for 0.23.1 (at a minimum reverse). But it would be nice to at once get the option in there as well.
I am only not fully sure about the function interface (in case this needs more discussion for the exact API of the passed function). Maybe we could limit it for 0.23.1 only to adding method='default'/'multi', and leave the ability to pass a function for 0.24.0

gnfrazier · 2018-09-13T19:02:35Z

Had the initial issue using 23.0 updating to 23.4 corrected the problem. Thanks for the fix.

theSuiGenerisAakash · 2020-02-27T02:00:39Z

On 1.0.1 and still facing the same issue

vaibhav-chopda · 2020-04-08T08:52:14Z

@theSuiGenerisAakash try removing "method='multi'" . This might help

minderchen · 2020-05-11T14:54:54Z

reduce to see whether it will work
chunksize=100

theSuiGenerisAakash · 2020-05-12T07:29:29Z

@theSuiGenerisAakash try removing "method='multi'" . This might help

This defeats the purpose

eromoe · 2020-05-18T15:01:38Z

Happen on pandas 1.0.1 , this issue should reopen!
session is sqlite without method='multi'

df_current.to_sql(data_schema.__tablename__, db_engine, index=False, if_exists='append')

os: win 10 , seems some people don't have this problem on linux and mac .

AntoineWeber · 2020-12-17T14:11:36Z

Getting the error on 1.1.5. For a wide dataframe (72 cols), I cannot go above a chunksize of 10.

Uploading my dataframe (of only 5mb) took initially 30min, adding method = 'multi' sped it up 10x to be around 3min (with a chunksize of 10). Trying bigger chunks gives me the error of too many parameters.

I am working on Win10

EDIT : It actually sometimes work and sometimes fail. That's unfortunate.

jorisvandenbossche added this to the 0.23.1 milestone May 17, 2018

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version IO SQL to_sql, read_sql, read_sql_query labels May 17, 2018

schettino72 added a commit to schettino72/pandas that referenced this issue May 25, 2018

ENH: 'to_sql()' add param 'method' to control insert statement (panda…

fce019b

…s-dev#21103) Also revert default insert method to NOT use multi-value.

schettino72 mentioned this issue May 25, 2018

ENH: 'to_sql()' add param 'method' to control insert statement (#21103) #21199

Closed

5 tasks

chris-b1 mentioned this issue Jun 1, 2018

Problem in df.to_sql() with SQL Server in pandas 0.23 #21288

Closed

jorisvandenbossche mentioned this issue Jun 7, 2018

Revert "enable multivalues insert (#19664)" #21355

Merged

jorisvandenbossche closed this as completed in #21355 Jun 7, 2018

AntoineWeber mentioned this issue Dec 17, 2020

ERROR : "('HY000', 'The driver did not supply an error!')" or "The incoming request has too many parameters" when trying to push a df to a MS SQL database using the to_sql() method with argument method='multi' #38541

Closed

"too many SQL variables" Error with pandas 0.23 - enable multivalues insert #19664 issue #21103

"too many SQL variables" Error with pandas 0.23 - enable multivalues insert #19664 issue #21103

Comments

tripkane commented May 17, 2018 • edited Loading

Problem description

Expected #Output

Output of pd.show_versions()

INSTALLED VERSIONS

kristang commented May 17, 2018

tripkane commented May 17, 2018

kristang commented May 17, 2018

TomAugspurger commented May 17, 2018

tripkane commented May 17, 2018

tripkane commented May 17, 2018

TomAugspurger commented May 17, 2018 via email

tripkane commented May 17, 2018 • edited Loading

tripkane commented May 17, 2018

jorisvandenbossche commented May 17, 2018

tripkane commented May 17, 2018

schettino72 commented May 18, 2018

TomAugspurger commented May 18, 2018 via email

tripkane commented May 18, 2018

schettino72 commented May 21, 2018

cpcloud commented May 22, 2018

TomAugspurger commented May 22, 2018 via email

TomAugspurger commented May 22, 2018

jorisvandenbossche commented May 22, 2018

cpcloud commented May 22, 2018

schettino72 commented May 24, 2018

jorisvandenbossche commented May 28, 2018

schettino72 commented May 29, 2018

jorisvandenbossche commented May 30, 2018

danfrankj commented May 31, 2018

danfrankj commented May 31, 2018

jorisvandenbossche commented Jun 6, 2018

schettino72 commented Jun 6, 2018

TomAugspurger commented Jun 6, 2018 via email

jorisvandenbossche commented Jun 6, 2018

gnfrazier commented Sep 13, 2018

theSuiGenerisAakash commented Feb 27, 2020

vaibhav-chopda commented Apr 8, 2020

minderchen commented May 11, 2020

theSuiGenerisAakash commented May 12, 2020

eromoe commented May 18, 2020 • edited Loading

AntoineWeber commented Dec 17, 2020 • edited Loading

tripkane commented May 17, 2018 •

edited

Loading

Output of `pd.show_versions()`

tripkane commented May 17, 2018 •

edited

Loading

eromoe commented May 18, 2020 •

edited

Loading

AntoineWeber commented Dec 17, 2020 •

edited

Loading