New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executemany #491

Closed
cp2587 opened this Issue Dec 2, 2016 · 17 comments

Comments

Projects
None yet
7 participants
@cp2587
Copy link

cp2587 commented Dec 2, 2016

Hello,

I am using psycopg2 and i noticed that when doing 'executemany' with insert statement, the library only iterate through the list and send single INSERT statement. This is really slow compared to a multiple values INSERT.
What is the recommended way of building a multiple INSERT .. VALUES ... statement using psycopg2 ? (my understanding was that executemany is build for that but obviously i am wrong)

EDIT: I just found a topic related to this problem here: https://www.postgresql.org/message-id/CA+mi_8Zeeg93nHDXiZo0U16y-ygYN+FSCxD34nsmkTjkZqqfZA@mail.gmail.com.

Maybe you could explain a bit how executemany works in your documentation, like here: http://initd.org/psycopg/docs/faq.html#best-practices ?

@fogzot fogzot added the enhancement label Dec 2, 2016

@fogzot

This comment has been minimized.

Copy link
Member

fogzot commented Dec 2, 2016

You're right but that optimization was never implemented, in part because you need an SQL parser to understand when it is possible to switch to a single INSERT and both me and Daniele don't like the idea of having a full SQL parser inside psycopg.

I'l leave this issue open as "enhancement", to track ideas about how to implement this in a backward compatile way.

@zzzeek

This comment has been minimized.

Copy link

zzzeek commented Dec 2, 2016

it's also curious that we timed inserting 80K six-integer rows into an empty, unindexed table over a moderate network takes 3-5 minutes, however the identical operation against a MySQL database on the same server as the PG one takes about 5 seconds, with the native or the the pure Python driver - MySQL doesn't support the "VALUES" approach either. how is libpq so much slower overall?

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Dec 2, 2016

The currently available fast way to load data into postgres is to use COPY. This however doesn't play well with Python types as psycopg adaptation is not invoked.

Something else that can be done is to prepare the insert: a proof of concept of a preparing cursor is in this gist.

Both the above could be implemented in psycopg if we teach it a different adaptation method (to postgres format, without the quotes) and used the more up-to-date communication protocol with the server: these are ideas for psycopg3.

To stick to the present (psycopg3 is not currently in development), I think a fastest way to insert several records would be something along the lines of:

  • batching your data into reasonably-sized lists of tuples
  • using a form like insert into table select * from unnest(array)

It takes some trip into obscure sql and postgres data model corners to get it work, but this sort of does:

in psql:

=> create table testint (id integer, data text);
CREATE TABLE

in python:

data = [(1, 'foo'), (2, 'bar'), (3, 'baz')]
cur.execute("insert into testint select * from unnest(%s) as (id int, data unknown)", [data])

omitting the as (...) part will result in an error such a column definition list is required for functions returning "record". Using a more intuitive data text will result in the error function return row and query-specified return row do not match (returned type unknown at ordinal position 2, but query expects text). I wonder if Postgres could be smarter with this cast...

This solution works as expected and allows to call execute less often (fewer network roundtrips), several records inserted per query, and customized Python-postgres adaptation. It should be slower than COPY but faster than executemany(), with a convenience close to the latter.

@pmart123

This comment has been minimized.

Copy link

pmart123 commented Dec 27, 2016

Previously, I've handled bulk inserts via chunking an iterable into X length and building a template string with the same number of inputs such as line 38, insert_many . unnest seems cleaner in some ways, but I believe has similar performance(both of which are 7xish times slower than copy from my recollection). @zzzeek Either is much faster than single inserts though, and upserts have no way to get around single insertion that I know of? @zzzeek One of the more frustrating aspects I found with postgres's copy was passing nullable vs. none nullable column specification to build the copy statement. Any chance/interest in sqlalchemy implementing a higher level copy that does this as a default?

@zzzeek

This comment has been minimized.

Copy link

zzzeek commented Dec 28, 2016

@pmart123 SQLAlchemy's SQL API is extensible and something odd like COPY should likely be a third-party extension.

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Dec 30, 2016

@pmart123 I don't know if you are aware of that, but in a discussion on the ML I've floated the idea of reimplementing executemany with the equivalent of execute(";".join(coalesce(sql, args)). The implementation of the idea is in this gist.

Quick testing, in local only, not on a network, show about a 100% speedup. I'd like to know what you (and @zzzeek , and @fogzot, and @cp2587) think about the idea. If it looks viable I can quickly implement it and release it in psycopg 2.7.

@zzzeek

This comment has been minimized.

Copy link

zzzeek commented Jan 1, 2017

@dvarrazzo 100% speedup without network is great, if you throw a network in there it will hopefully be much more dramatic.

Do we have some idea if multiple INSERT separated by ";" is similar, performance wise, to single INSERT with multiple VALUES ?

Things I'm thinking of:

  • if executemany() for UPDATE or DELETE, can we make sure we return an aggregate number for cursor.rowcount ?
  • will the behavior be configurable?
@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Jan 2, 2017

I expect an insert with multiple value would be somewhat faster than many inserts, but it would be a "cpu-level faster", likely lost in the noise of speedup that can be obtained reducing the number of network roundtrips. However joining statements by semicolons is the only thing psycopg can do without getting in the business of parsing the statement: if someone wanted to generate an insert with multiple rows it should be done before hitting psycopg, for instance calling mogrify only on a (%s, %s, ...%s) template and then executing insert into table values %s" % ",".join(list), using execute() instead of executemany().

I would make the behaviour configurable using an executemany parameter with the number of statements to batch together. Setting it to <=1 I would return to the previous behaviour, but if we don't see shortcomings in the features I'd keep a "working" default, e.g. 100.

Aggregating cursor rowcount is an interesting idea. I don't think we currently do anything special with it, i.e. I think it would return the value for the last statement, and I don't think the batched behaviour would be different. So all in all it would be an independent feature. However the number returned by the server is not entirely reliable: for instance inserting into a partitioned table (where the statement is intercepted by a trigger and a row inserted in a different table) returns an "INSERT 0" result.

@zzzeek

This comment has been minimized.

Copy link

zzzeek commented Jan 3, 2017

For the rowcount, we are concerned with UPDATE and DELETE and not really INSERT - the usefulness of rowcount is that you can verify that you matched the number of rows you expected to match with your WHERE clause, which is an essential part of how to detect concurrent changes to a row. If I have an UPDATE statement that expects to match one row per query (because it uses primary key), and I run it against ten statements, I'd ideally want rowcount to be ten. SQLAlchemy when it is in "detect concurrent updates" mode actually has to forego using executemany() for those DBAPIs that don't aggregate this count.

I have psycopg2 as of version 2.0.9 aggregating these counts together for executemany(), and it passes all the tests so that indicates it's likely doing this already for UPDATE / DELETE, so I would just want that behavior to be maintained.

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Jan 4, 2017

What you say sounds a-right: I wasn't aware of this behaviour.

>>> cur.executemany("insert into test default values", range(10))
>>> cur.rowcount
10

I'm not sure it could be maintained joining the statements :\ That's a point against it...

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Feb 1, 2017

Adding functions into the extras module to implement the ideas discussed here and in the ML thread without changing the executemany() implementation.

@altaurog has reported the following speedups over transatlantic connection:

  • cur.executemany(): 761.322767019 sec
  • execute_batch(): 14.6529989243 sec
  • execute_values(): 12.4037430286 sec
  • pgcopy: 2.85529208183 sec
@zzzeek

This comment has been minimized.

Copy link

zzzeek commented Feb 1, 2017

OK, so the plan is vanilla executemany() staying the same and I can add new SQLAlchemy flags so my users can get all these variants? That way I can also manipulate the "rowcount" compatibility flags predictably.

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Feb 1, 2017

@zzzeek correct: my main concern is rowcount after executemany() changing too subtly, so let's leave it untouched. Of course the rowcount after calling these function will not result in the complete values.

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Feb 2, 2017

Branch merged to master. Here are the docs for the new functions. Closing the issue, but there is still space for some feedback.

@dvarrazzo dvarrazzo closed this Feb 2, 2017

@altaurog

This comment has been minimized.

Copy link

altaurog commented Feb 4, 2017

I think the use of the word "prepare" in the documentation for execute() and executemany() is confusing, even with the warning. Perhaps it is superfluous?

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Feb 5, 2017

Possibly it is. These docs comes straight from the DBAPI specs and don't reflect psycopg implementation. I'll review them thanks.

@BU-AWolfe

This comment has been minimized.

Copy link

BU-AWolfe commented Apr 3, 2018

Just so you know... I worked many years at Oracle. Oracle set up "bulk binds" in its OCI interface for input and output parameters so that

  1. a single round-trip could work on thousands of individual rows
  2. the operation could not be injected through its use of string formatting

This was much faster than row-at-a-time operations with the same data.

PostgreSQL is, of course, quite a different beast but this is a reason why executemany with a single execute is a reasonable implementation choice.

alanorth added a commit to ilri/dspace-statistics-api that referenced this issue Sep 26, 2018

indexer.py: Use psycopg2's execute_values to batch inserts
Batch inserts are much faster than a series of individual inserts
because they drastically reduce the overhead caused by round-trip
communication with the server. My tests in development confirm:

  - cursor.execute(): 19 seconds
  - execute_values(): 14 seconds

I'm currently only working with 4,500 rows, but I will experiment
with larger data sets, as well as larger batches. For example, on
the PostgreSQL mailing list a user reports doing 10,000 rows with
a page size of 100.

See: http://initd.org/psycopg/docs/extras.html#psycopg2.extras.execute_values
See: psycopg/psycopg2#491 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment