Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: database/sql: support a way to perform bulk actions #5171

Open
the42 opened this issue Mar 31, 2013 · 19 comments
Open

proposal: database/sql: support a way to perform bulk actions #5171

the42 opened this issue Mar 31, 2013 · 19 comments

Comments

@the42
Copy link

@the42 the42 commented Mar 31, 2013

The current implementation of database/sql doesn't provide a way to insert multiple rows
at once, without getting to the wire upon every call to

db.Exec

There are APIs outside which either provide a general way for bulk actions

cf.  SQLBulkCopy of ADO.NET 2.0 [1]

or has a specifier upto how many statements should be grouped together

cf. Batch Update of ADO.NET 2.0 DataAdapter Object [1]

or simply supports an array to be bound to Exec, open which Exec iterates internally,
preventing execessive wire communication. [2]



[1] Codeproject, "Multiple Ways to do Multiple Inserts"
http://www.codeproject.com/Articles/25457/Multiple-Ways-to-do-Multiple-Inserts
[2] Python PEP 249 -- Python Database API Specification v2.0
http://www.python.org/dev/peps/pep-0249/#executemany
@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Mar 31, 2013

Comment 1:

Labels changed: added priority-later, suggested, removed priority-triage.

Status changed to Accepted.

@rsc
Copy link
Contributor

@rsc rsc commented Nov 27, 2013

Comment 2:

Labels changed: added go1.3maybe.

@rsc
Copy link
Contributor

@rsc rsc commented Dec 4, 2013

Comment 3:

Labels changed: added release-none, removed go1.3maybe.

@rsc
Copy link
Contributor

@rsc rsc commented Dec 4, 2013

Comment 4:

Labels changed: added repo-main.

@kardianos
Copy link
Contributor

@kardianos kardianos commented Aug 19, 2014

Comment 5:

Of note, doing this well would require new database/sql API for both the driver and the
front end code.
Having implemented several protocols that include a bulk copy method, this needs to be
called out as different, as much of the control that is available in a Insert statement
is exposed differently then in SQL.
I'm wary of any suggestions to bind to an array, as arrays are legitimate data types in
several rdbms. I'll be implementing a general interface shortly in the rdb front end.
Send me a line if you'd like to discuss.

@the42 the42 added accepted Suggested labels Aug 19, 2014
@bmharper
Copy link

@bmharper bmharper commented Jan 27, 2015

For what it's worth - I've just tried using the lib/pq driver's Copy functionality to do bulk loading, and although I can't comment on whether this would work for other DB drivers, it seems like a reasonable API.

@perillo
Copy link
Contributor

@perillo perillo commented Mar 7, 2016

The standard syntax for multi-value INSERT is (from PostgreSQL documentation):

INSERT INTO films (code, title, did, date_prod, kind) VALUES
    ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
    ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');

What about adding a Value type that can be passed as the args argument to Exec or Query?

As an example:

type Value []interface{} // defined in the sql package

batch := []Value
for i := 0; i < N; i++ {
    batch = append(batch, Value{1, 1.3, "x"})
}
db.Exec("INSERT INTO films (code, title, did, date_prod, kind) VALUES ?", batch)

This will not require any changes to the existing interface.

@kostya-sh
Copy link
Contributor

@kostya-sh kostya-sh commented Mar 7, 2016

This will require holding data for the whole batch in memory. Also this proposal doesn't allow to batch updates if supported by database.

FYI there is implementation of batch insert in github.com/lib/pq based on COPY.

@perillo
Copy link
Contributor

@perillo perillo commented Mar 7, 2016

On Mon, Mar 7, 2016 at 4:17 PM, kostya-sh notifications@github.com wrote:

This will require holding data for the whole batch in memory.

Of course, this is required by standard INSERT statement.

Also this proposal doesn't allow to batch updates if supported by database.

By batch, do you mean multiple SQL statements in the same query, separated
by semicolon?

I will be happy with just the support for multi-value INSERT, since it is
directly supported by the SQL standard (using the VALUES construct):

type Tuple []interface{}

type Values []Tuple

FYI there is implementation of batch insert in github.com/lib/pq based on

COPY.

COPY is not standard, and the github.com/lib/pq seems (just looking at the
API) to store the whole batch in memory.

@kostya-sh
Copy link
Contributor

@kostya-sh kostya-sh commented Mar 7, 2016

By batch, do you mean multiple SQL statements in the same query, separated by semicolon?

Yes, this would require to support multiple return values though.

COPY is not standard, and the github.com/lib/pq seems (just looking at the API) to store the whole batch in memory.

COPY is not standard indeed but it is fast and the driver doesn't hold the whole batch in memory. Have a look at the implementation at https://github.com/lib/pq/blob/master/copy.go

I agree it would be nice to have generic batch API but it is quite difficult to design a single API that will allow drivers to choose the optimal method to implement batched operations. I think using driver library directly is a quite good compromise.

BTW, in postgresql it is also possible to use the following SQL for bulk insert:

INSERT INTO mytable (col1, col2, col3) VALUES (unnest(?), unnest(?), unnest(?))

I haven't use it though and I don't know if Go driver supports arrays.

@perillo
Copy link
Contributor

@perillo perillo commented Mar 7, 2016

On Mon, Mar 7, 2016 at 5:47 PM, kostya-sh notifications@github.com wrote:

[...]

BTW, in postgresql it is also possible to use the following SQL for bulk
insert:

INSERT INTO mytable (col1, col2, col3) VALUES (unnest(?), unnest(?), unnest(?))

I haven't use it though and I don't know if Go driver supports arrays.

This is what I was speaking about. And it is not PostgreSQL specific, but
SQL standard.
It does not use an array, but the VALUES statement:
http://www.postgresql.org/docs/9.5/static/sql-values.html

In Go, it can be defined, e.g.:

type Tuple []interface{} // Since Row and Value are already defined

type Values []Tuple

This have the advantage that a Values value can be specified as a parameter
to Query or Exec function without changing the sql package API.

@natemurthy
Copy link

@natemurthy natemurthy commented Apr 27, 2018

@perillo I tried your method:

type Value []interface{} // defined in the sql package

batch := []Value
for i := 0; i < N; i++ {
    batch = append(batch, Value{1, 1.3, "x"})
}
db.Exec("INSERT INTO films (code, title, did, date_prod, kind) VALUES ?", batch)

but results in the error:

     |  Error:          Expected nil, but got: &errors.errorString{s:"sql: converting argument $1 type: unsupported type []Value, a slice of slice"}

@pimbrouwers
Copy link

@pimbrouwers pimbrouwers commented Sep 5, 2018

Totally beyond the scope of golang. You could always manually open a transaction, process all your inserts individually (don't worry there's connection pooling), and commit the transaction. Avoiding the overhead of the implicit transaction, on each iteration, will be huge win.

For accessing the bulk features of the various RDBMS, like bcp in SQL Server, for example you can always save the csv to disk and use exec to run the batch.

@kardianos
Copy link
Contributor

@kardianos kardianos commented Sep 5, 2018

@pimbrouwers I disagree. I think it would be great to create a standard bulk data interface. Yes, opening a transaction will increase the speed of many inserts. But bulk interfaces are also useful. I personally hate relying on native commands like bcp to be present on the host system.

@pimbrouwers
Copy link

@pimbrouwers pimbrouwers commented Sep 6, 2018

Unless the standard interface has platform specific adapters, it won't work, because there isn't an adhered to standard for bulk action, for example SQL Server uses BULK INSERT... and MySQL uses LOAD DATA INFILE....

As far as relying on the presence of bulk tools, I see nothing wrong with that. You physically cannot install SQL Server with bcp, which SQL Server relies upon internally for so many things. So the presence of SQL Server means bcp is also present.

@kardianos
Copy link
Contributor

@kardianos kardianos commented Sep 6, 2018

Yes, like query protocols, each would need a driver.

I run my app on Linux and Ms SQL server on Windows. The app won't have bcp installed. Often different boxes or app in bare container. Or database hosted as a service.

@ALTree ALTree added help wanted and removed Suggested labels Jul 1, 2019
@thda
Copy link

@thda thda commented Jul 2, 2020

I think a way to batch values before sending them is needed. It would save network roundtrips
and allow the driver to parse the statements and transparently switch to bulk inserts.

For reference jdbc provides 3 functions:

  • addBatch to push values
  • sendBatch to flush to the server
  • clearBatch to discard the statements without sending them.

Statements could implement 2 methods to mirror this functionality:

// add the values to a batch of statements
AddValues(args []driver.NamedValue) error
// initialize/clear the batch of values
ClearValues() error

Then calling an Exec/ExecContext would send the batched statements to the server.
I don't know if a context would be needed for AddValues here, as there probably should not be
round trips to the server when enqueuing Values.

muhlemmer added a commit to muhlemmer/go that referenced this issue Oct 17, 2020
Related to golang#5171

Change-Id: I46a6d12b46d3802a338e5733ca81e8a0fb2ae125
@muhlemmer
Copy link
Contributor

@muhlemmer muhlemmer commented Oct 17, 2020

I crafted a proposal in my own fork. I'm open to comments.

If we get to something definitive, I'll write the necessary units tests and submit a CL. I'm also thinking of preparing a reference implementation on a fork of lib/pq, to be PR-ed after this ever gets into a release.

First let's talk if the proposed API is useful.

I would like to note I'm a PostgreSQL / lib/pq user. The below proposal is based on what I know about those. Perhaps naming of types is biased towards the PostgreSQL terminology. We can change it to anything more suitable if required.

Driver interface

In order to truly benefit from increased performance and less (or single) round trips, a bulk action should happen asynchronous. If that is not the case, a prepared statement would suffice anyway.

Currently lib/pq uses a prepared statement, but switches to asynchronous streaming in the implementation code. Callers need to call stmt.Exec(args) for every row of data. Exec may return an error that is related to this or an earlier call, due to buffering. In order to flush the buffer and terminate the COPY, one call to Exec() without args must be made. Although this is workable, it is a confusing interface. For instance, one can't reliably determine the amount of data sent or written to the database.

In order to properly describe and work with the asynchronous relationship of data, result and error, I would like to propose the use of channels. The driver shall return a write-only data and a read-only result channel.

  • When the sender is terminating, data will be closed
  • data may be a buffered channel, that's up to the driver.
  • When the driver encounters and error, of finished flushing after closing of data, it sends exactly 1 result.
  • Result may have both Res and Err populated to indicate a partial write at time of error.

To be discussed:

  • As far as I understand not all drivers have Result support? I'm not sure how to handle that, other than to document this behavior.
  • Other optional interfaces use some kind of fallback and ErrSkip. In this case there is nothing real to fallback to, unless we are going to use a hard-coded INSERT query on a prepared statement and Exec for each row, just to simulate the batch behavior. I don't think we should do something like that. For instance, are all drivers / dialects compliant to the INSERT INTO mytable (col1, col2, col3) VALUES (?, ?, ?) syntax?
// CopyResult signals termination of a CopyIn.
// Both Res and Err may be populated in case of a partial write.
type CopyResult struct {
	Res Result
	Err error
}

// Copier is an optional interface that may be implemented by a Conn.
type Copier interface {

	// CopyIn sends a batch of data to the server in a single execution.
	// The driver may do this asynchronously.
	//
	// The data channel may be buffered.
	// The sender closes the data channel when all data is sent.
	//
	// Exact one of CopyResult must always be send on the result channel.
	// Either after all data is flushed, or after encountering an error.
	CopyIn(ctx context.Context, table string, columns ...string) (data chan<- []NamedValue, result <-chan CopyResult)
}

sql package interface

Here I went with an interface type, passed to CopyIn(). With batches, the data set might come from another source like a file, decoder or network. This prevents the need of the data set to be loaded in memory. Conveniently, io.EOF is used to signal the end of a batch.

If the Batch interface is accepted, I intend to provide a helper function that can turn [][]interface{} into a Batch.

// Batch is input to a CopyIn command.
type Batch interface {
	Table() string
	Columns() []string
	// Next returns a row of data in a batch.
	// It should return io.EOF when no more data is available.
	Next() ([]interface{}, error)
}

// ErrNotCopier is returned when the driver does not support batch operations.
var ErrNotCopier = errors.New("sql: driver does not support CopyIn")

// CopyIn sends the provided batch in a single, asynchronous operation.
//
// In case of an error, partial data may have been written to the database.
// If supported, a driver may return both Result and error to indicate the amount of rows written.
func (tx *Tx) CopyIn(ctx context.Context, batch Batch) (Result, error) {
	dc, release, err := tx.grabConn(ctx)
	if err != nil {
		return nil, err
	}
	return tx.db.copyDC(ctx, dc, release, batch)
}

The rest of the copyDC implementation can be found here. It compiles, but I haven't tested anything yet, pending some feedback here.

@odeke-em odeke-em changed the title database/sql: should support a way to perform bulk actions proposal: database/sql: support a way to perform bulk actions Mar 10, 2021
@odeke-em
Copy link
Member

@odeke-em odeke-em commented Mar 10, 2021

Cool, thank you @muhlemmer for the proposal ideas. This issue is very old and I didn’t want to edit the original issue, but given your elaborate post in #5171 (comment), perhaps let’s reuse this issue as is and am kindly putting it on the radar of the @golang/proposal-review team to skip to the linked comment.

@odeke-em odeke-em removed this from the Unplanned milestone Mar 10, 2021
@odeke-em odeke-em added this to the Proposal milestone Mar 10, 2021
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals Mar 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Proposals
Incoming
Development

No branches or pull requests